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Abstract 

We consider the scheduling problem in downlink wireless networks with heterogeneous, Markov-modulated, 
ON/OFF channels. It is well-known that the performance of scheduling over fading channels heavily depends on 
the accuracy of the available Channel State Information (CSI), which is costly to acquire. Thus, we consider the 
CSI acquisition via a practical ARQ-based feedback mechanism whereby channel states are revealed at the end 
. of only scheduled users' transmissions. In the assumed presence of temporally-correlated channel evolutions, the 

■ desired scheduler must optimally balance the exploitation-exploration trade-ojf, whereby it schedules transmissions 
. both to exploit those channels with up-to-date CSI and to explore the current state of those with outdated CSI. 

■ In earlier works. Whittle's Index Policy had been suggested as a low-complexity and high-performance solution 
to this problem. However, analyzing its performance in the typical scenario of statistically heterogeneous channel 
state processes has remained elusive and challenging, mainly because of the highly-coupled and complex dynamics 
it possesses. In this work, we overcome these difficulties to rigorously establish the asymptotic optimality properties 
of Whittle's Index Policy in the limiting regime of many users. More specifically: (1) we prove the local optimality 
of Whittle's Index Policy, provided that the initial state of the system is within a certain neighborhood of a carefully 
selected state; (2) we then establish the global optimality of Whittle's Index Policy under a recurrence assumption 
that is verified numerically for the problem at hand. These results establish, for the first time to the best of our 

^ ■ knowledge, that Whittle's Index Policy possesses analytically provable optimality characteristics for scheduling 

over heterogeneous and temporally-correlated channels. 
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I. Introduction 
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^ ■ Channel fluctuation is an intrinsic characteristic of wireless communications. Such a variation calls for allocation 
^ , of the wireless resources in a dynamic manner, leading to the classic opportunistic scheduling principle (e.g., 
in. lEl)- Under the assumption that the instantaneous channel state information (CSI) is fully available to the 
scheduler, many efficient opportunistic scheduling algorithms (e.g., S-lll) have been proposed and extensively 
studied. 

^ \ More recent works have focused on designing scheduling algorithms under imperfect CSI, where the channel 
J> ■ state is modeled as independent and identically distributed (i.i.d.) processes across time (e.g., [61, Q). On the 

^ . other hand, although the i.i.d. channel model brings ease of analysis, it fails to capture the time-correlation of 
] the fading channels Specifically, it fails to exploit the channel memory, which is a critical resource for 

5^ ■ making scheduling decisions. However, designing efficient scheduling schemes under time-correlated channels 
with imperfect CSI is a very challenging problem. The challenge is mainly because of the difficulty in making 
the classic 'exploitation versus exploration' trade-off, in which a scheduler needs to strike a balance between 
selecting the channels with up-to-date channel memory that guarantees high immediate gains, or to explore the 
channels with outdated CSI for more informed decisions and associated future throughput gains. 

We consider the downlink scheduling problem where a base station transmits to the users within its transmission 
range, subject to scheduling constraints. To model the time correlations present over fading channels, we assume 
that wireless channels evolve as Markov-modulated ON/OFF processes. The channel state information is obtained 
from ARQ-based feedback, only after each scheduled transmission. Nevertheless, due to time correlation, the 
memory of the past channel state can be used to predict the current channel state prior to scheduling decision. 
Hence, channel memory should be intelligently exploited by the scheduler in order to achieve high throughput 
performance. 
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In a related work a similar problem is considered under delayed CSI, where it is assumed that perfect CSI is 
available within a maximum delay, which is in turn smaller than the delay experienced by the ARQ feedback used 
for collision detection. These assumptions allow the scheduling decisions to be decoupled from CSI acquisition, 
which leads to the development of centralized as well as distributed schedulers. However, this approach does 
not use ARQ as a means of acquiring improved channel quality information. In contrast, in our setup the nature 
of ARQ feedback creates an implicit impact of scheduling decisions on the CSI feedback, which completely 
transforms the nature of the optimal scheduler design, and therefore requires a different approach. Under the 
scenario where all the channels have identical Markov statistics, round-robin-based algorithms (e.g., |[T0l - lfT2l ) 
have been shown to possess optimality properties in throughput performance. However, the round-robin-based 
algorithms are no longer optimal in asymmetric scenarios, e.g., when different channels have different Markov 
transition statistics, as is naturally the case in typical heterogeneous conditions. 

Under the asymmetric scenarios, our downlink scheduling problem is an example of the classic Restless Multi- 
armed Bandit Problem (RMBP) |[T3l . Low-complexity Whittle's Index Policies llT3l for the downlink scheduling 
problem have been proposed in |[T4ll[T5l based on RMBP theory. However, although Whittle's Index Policy 
can bring significant throughput gains by exploiting the channel memory |[T5l . the analytical characterization 
of its performance under asymmetric scenarios is very challenging and prohibitively technical. This is because 
asymmetry leads to a sophisticated interplay of memory evolution among channels with heterogeneous character- 
istics, which brings a significant challenge to the analysis of Whittle's Index Policy not present in the perfectly 
symmetric scenario. 

For RMBP problems under general scenarios. Whittle's Index Policy has been proven in 1,161 to be asymptoti- 
cally optimal as the number of users grows, provided a non-trivial condition, known as Weber's condition, holds. 
Nonetheless, Weber's condition concerns the global convergence of a non-linear differential equation, which is 
extremely difficult to verify even numerically in our downlink scheduling scenario. 

In this paper, we take significant steps in analyzing the optimality properties of Whittle's Index Policy for 
the downUnk scheduling problem in the presence of channel heterogeneity. Specifically, our contributions are as 
follows. 

• We apply the Whittle's index framework to our downlink scheduling problem and identify the optimal policy 
for the problem with a relaxed constraint in Section |llll This policy, with carefully selected randomization, 
provides a performance upper bound to Whittle's Index Policy. 

• We establish the local optimality of Whittle's Index Policy in the asymptotic regime when the number of 
users scales in Section |Vl Specifically, we show that the performance of the index policy can get arbitrarily 
close to that of the relaxed-constraint optimal policy, provided that the initial state of the system is within 
a certain neighborhood of a carefully selected state. 

• Based on the local optimality result, under a numerically verifiable recurrence assumption, we then establish 
the global optimality of Whittle's Index Policy in the limiting regime of many users in Section IVTl 

To the best of our knowledge, our work is the first to give analytical characterization of Whittle's Index Policy 
for downlink scheduling under channel heterogeneity. 

II. System Model and Problem Formulation 
A. Downlink Wireless Channel Model 

We consider a time-slotted, wireless downlink system with one base station and N users. The wireless channel 
Ci [t] between base station and user i remains static within each time slot t and evolves stochastically across time 
slots, independently across users. We adopt the simplest non-trivial model of time-correlated fading channels by 
considering two-state ON/OFF channels, where the state space of channel i is Si = {0, 1}, with the value of 
each state representing the transmission rate a channel can support at the state. 

One important component of our model is the inclusion of channel heterogeneity that the users will typically 
experience in real systems. Such asymmetry creates a significant challenge to the design and analysis of optimal 
scheduling schemes compared to perfectly symmetric channels. To avoid cumbersome notation and unessential 
technical complications, in this work we model channel asymmetry by considering only two classes of channel 
statistics. Specifically, for all the channels in class k, k=l,2, their states evolve according to the same Markov 
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Fig. 1: Two state Markov chain model for channels in class k. 

statistics. However, these characteristics differ between classes. The state transition of channels in class k is 
depicted in Fig. [T] represented by a 2 x 2 probability transition matrix, 

L^fc 1 - rfcj ' 

where 

Pk :=prob(Q[t]=l I C,[t-l]=l), 

Tk :=prob(C7i[t]=l | C,[t-1]=0). 
for channel i in class k. The number of class k channels is 7fcA^, k G {1,2} with 7^ being the proportion of 
channels in class k with respective to the total number N of channels. 

We study the scenario where all the Markovian channels are positively correlated, i.e., pk > rk for k=l,2. 
This assumption, which is commonly made in this domain (e.g., |[T2l . ifTTl ). means that the channel evolution has 
a positive auto-correlation. Hence, roughly speaking, the channel has a stronger potential to stay in its previous 
state than jumping to another, which is typical especially in slow fading environment. For ease of exposition, we 
shall exclude the trivial case when rk = or pk = l, k = 1, 2. 

B. Scheduling Model — Belief Value Evolution 

We assume that the base station can simultaneously transmit to at most aN G Z+ users in a time slot without 
interference, where aG(0, 1] stands for the maximum fraction of users that can be activated. For example, in a 
multi-channel communication model, a would correspond to the fraction of all users that can be simultaneously 
serviced in unit time. However, the scheduler does not know the exact channel state in the current slot when the 
scheduling decision is made. Instead, the scheduler maintains a belief value 7rj[t] for each channel i, which is 
defined as the probability of channel i being in the ON state at the beginning of slot t. The accurate channel state 
is revealed via ACK/NACK feedback from the scheduled users, only at the end of each time slot after the data 
is transmitted. This accurate channel state feedback is in turn used by the scheduler to update the belief values. 

For user i in class k, k=l,2, let Oi[i]€{0, 1} indicate whether the user is selected for transmission in slot t. 
Then, from the definition the belief values, TTi[t] evolves as follows. 



7ri[t+l]-- 



(1) 



'pk, if ai[t]=l, Ci[t]=l, 

rk, if ai[t]=l, Ci[t]=0, 

^Tri[t]pk + {l-TTi[t])rk, if ai[t]=0. 

In our setup, belief values are known to be sufficient statistics to represent the past scheduling decisions and 
feedback (e.g., IfTTl . |[T8l ). In the meanwhile, in our ON/OFF channel model, 7rj[t] also equals to the expected 
throughput contributed by channel i if it is scheduled in time slot t. 

For a user in class k, k=l,2, we use b^^ to denote its behef value when the most recent observed channel 
was c € {0, 1}, and is / slots in the past. From the belief update rule ([T]), 6^^ can be calculated as a function of 
1>1 as. 



rk-iPk-TkYrk k 



rk + i'^ -Pk){Pk - rkY 



l + rk-pk l + rk-pk 

Fig. |2] illustrates the belief value update when a channel stays idle (i.e., aj=0). It is clear that if the scheduler 
is never updated of the state of channel i (in class k), the belief value will converge to its stationary probability 
of being ON, denoted by the stationary belief value b'^:=rk/{l+rk—pk)- 

The vector tt [t]=(7ri[t], • • - j-KNlt]) denotes the belief values of all channels at the beginning of slot t. We use 
Bk to represent the set of the belief values for class k channels, where Bk={bg, fej cG {0, 1}, / € We assume 



Time of staying idie: i 

Fig. 2: Belief values update when staying idle, pk = 0.8, = 0.2. 



that the system starts to operate from slot t = 0. At the beginning of slot 0, for each channel the scheduler has 
either observed its channel state before, or has never been updated of its channel state, i.e., with belief value 
bg. It is then clear that, based on the belief update rule ([T]), TTi[t] € Bk for all t > 0, i.e., each belief value 7rj[t] 
evolves over countably many states. 

In the rest of the paper, we shall use 'belief value' and 'belief state' interchangeably. 



C. Downlink Scheduling Problem - POMDP Formulation 

We consider the broad class U of (possibly non-stationary) scheduling policies that makes a scheduling decision 
based on the history of observed channel states and scheduling actions. The downlink scheduling problem is then 
to identify a policy in U that maximizes the infinite horizon, time average expected throughput, subject to the 
constraint on the number of users selected at each time slot. Given the initial state vf [0] , the problem is formulated 

as, 

^ T-l N 
max lkninf-i?[j] J]7r,[t].ar[t]|7r[0] 



(2) 



N 



s.t. ^af[t]<aN, Vt. (3) 
1=1 

where the belief value tTj [t] evolves according to rule ([T]) based on the scheduling decision [t] under policy u. 
Such an objective is standard in literature for Markov Decision Processes under the long term average reward 
criteria (e.g., ||T9|). Noting that since the scheduling decisions are made based on incomplete knowledge of 
channel states, this problem is a Partially Observable Markov Decision Process ifTSl . 

This problem is in fact an example of Restless Multiarmed Bandit Problem (RMBP) |[T3l . For a general RMBP, 
finding an optimal solution is PSPACE-hard |[2CQ. However, for the downlink scheduling problem at hand, a low- 
complexity Whittle's Index Policy was proposed in llT4ll[T5l based on the RMBP theory that inherently exploits 
the channel memory when making scheduling decisions. For detailed descriptions of general RMBP and Whittle's 
Index Policy for downlink scheduling, please refer to |[T3l - |[T5l . 

For the downlink scheduling problem, we note that there is only limited analytical characterization of Whittle's 
Index Policy, which is restricted in perfectly symmetric scenarios where Whittle's Index Policy takes a special 
round-robin form U4L In asymmetric cases, however, the scheduling decision no longer takes the form of round- 
robin, bringing sophisticated complications in beUef value evolutions that are tightly coupled among channels, 
which significantly complicates the analysis. The main focus of this paper is to analytically characterize the 
performance of Whittle's Index Policy in the asymmetric case with two classes of channels. 



III. Upper Bound on Achievable Throughput 

We begin our analysis by characterizing an upper bound to the throughput performance of all feasible downlink 
scheduling policies that satisfies the constraint Q. The upper bound is obtained from a fictitious policy which 
is optimal for the downlink scheduling problem under a relaxed constraint. 

Note here that such relaxation is also a crucial step in the study of the general RMBP problem. Yet, our 
analysis, being specific to the downlink scheduling problem, has its novelties, as we shall remark on later. 
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A. Average-Constrained Relaxed Scheduling Problem 

We consider an associated relaxed problem of (2)-(3) that only requires an average number of users to be 
activated in the long run, defined as follows 



max lim inf 

mgc/ r->oo 



-A 

T I 

s.t. limsup;^^[V 



T-1 TV 

EE 

t=0 i=l 
T-1 N 



< aN. 



(4) 



(5) 



t=0 1=1 

Note that, contrary to the stringent constraint ([3]), the relaxed constraint ([5]) allows the activation of more than 
a fraction of users in each time slot, provided the long term average fraction does not exceed a. Hence the 
optimal policy under this relaxed constraint, which we shall identify next, provides a throughput upper bound to 
any policy that satisfies the stringent constraint. 



B. Optimal Policy for the Relaxed Problem 

We remark that the relaxed problem is also an important component of Whittle's analysis of general RMBPs 
|[T3l . in which an optimal policy for the relaxed problem is developed based on the Whittle's index values. 
Following the approach of classic RMBP framework |[T3l . in our downlink scenario, we identify an optimal 
policy for the relaxed problem based on Whittle's indices. 

Specifically, for channels in class k, the Whittle's index value Wkiir) is assigned to each belief state vr € B^- 
These index values intuitively capture the exploitation and exploration value to be gained from scheduling the 
associated channel when its belief value is vr. This characteristic of VFfc(7r) is also illustrated in Section IVII-BI 
via numerical investigations. While these index value functions have been expressed in closed form in various 
cases (see |[T4l|[T5l ). the following two characteristics they possess are primarily significant for our analysis: 

• Wkiir) monotonically increases with vr G Bk- 

. Wkln) € [0,1] for all vr € Bk- 

In the next proposition, we identify an index-based policy with appropriate randomization that is optimal for 
the relaxed constraint problem. This policy schedules each user based on its own belief value, independently 
from other users. 

Proposition 1. For the problem under relaxed constraint, there exists an optimal stationary policy (j)*, parame- 
terized by the threshold u* and a randomization parameter /5*€(0, 1], such that 

(i) Channel i in class k is scheduled if Wk{TTi[t]) > oj* , and stays idle if Wk{T^i[t]) <uj*. If Wk{T^i[t])=uj*, it is 
scheduled with probability p*. 

(ii) The parameters uj* and p* are such that, under policy (j)*, the relaxed constraint (O is strictly satisfied with 
equality. 

Proof: This proof the proposition builds on the RMBP theory |[T3ll[T4l along with optimization techniques. Details 
of the proof are given in Appendix [A] ■ 

From now on, we shall denote 0* as the ' Optimal Relaxed Policy' . For technical purposes, we henceforth assume 
a is such that Since each a value maps to a unique (w*, p*) pair (see Appendix lAl). only countably many 

a values correspond to />*=!, i.e., achieved by deterministic poUcies. Therefore, the set of Qe(0, 1] for which 
p*^l has Lebesgue measure one. 
Remarks: 

1) Our work is the first to identify the specific form of the optimal policy for the relaxed problem in downlink 
scheduling. We identify in Proposition [J that appropriate randomization is essential to guaranteeing the optimality. 
The randomization is important, because the deterministic policies are insufficient to guarantee optimality to 
general constrained Markov Decision Processes when both the reward and constraint are in the expected average 
form ifTOl . and thus unable to provide a throughput upper bound. 

2) Our objective function takes a very general form, it is not restricted to the family of stationary policies, nor 
does it require the existence of the limit (i.e., lim inf ^E[-] = lim^£^[-] in Q and whereas the existence 
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of limits (with different forms) is assumed in previous literatures |[T3l lfT4l on Whittle's Index Policy. Such an 
extension not only requires a non-trivial amount of technical work, but also is important to prove optimality of 
the stationary Optimal Relaxed Policy over a larger space of possibly non-stationary control strategies. 

C. Steady State Distribution of Belief Values 

We next present the transition structure of the belief values under Optimal Relaxed Policy, captured in the 
following lemma. The structure will be critical in the development of our subsequent main results. 

Lemma 1. For each channel in class k, under the Optimal Relaxed Policy, the structure of belief value evolution 
depends on the threshold uj* of policy. 

(i) If UJ* <Wk{bg), then the belief value evolution of each class k channels is positive recurrent with a finite 
recurrent class. 

(ii) If uj*>Wk{bg), the belief value evolution is transient. With probability 1, ultimately no channel in class k 
will transmit. 

Proof: The proof of this lemma follows from the monotonic structure of behef evolution, as shown in Fig. [2] 
Details are included in Appendix 10 ■ 

Thus, if to* > max{M^i(6g), W2{b1)}, the above analysis reveals that ultimately no user transits, corresponding 
to the trivial case of aN = 0. Also, if uj* is between Wi{bl) and W2{b'^), the class with the smaller Wk{b^) will 
eventually transit into a passive mode, hence reducing the system to a well-understood scenario with a single 
class of channels ifTOllfTTI . Thus, here we focus on the heterogeneous case of uj*<Wk{b';),k=l,2, where the 
steady-state belief value distribution exists for both classes under the Optimal Relaxed Pohcy. 

D. Upper bound on achievable throughput 

The throughput performance of Optimal Relaxed Policy provides an throughput upper bound for all policies 
under the stringent constraint. The value of such an upper bound clearly depends on the number of users in each 
class 7a;A^, k = l,2, as well as the fraction a of users allowed for activation. Denoting 7= [71, 72], we represent 
the time average expected throughput of the Optimal Relaxed Policy as '^^(7,0). The following lemma states 
that, as long as 7 and a are given, the per-user throughput is independent of A^. 

Lemma 2. Given 7 and a, " j^'"*^ is independent of N, denoted henceforth as r{-f,a). 

Proof: The proof follows from showing that, when the number of users N grows, as long as the proportion 
of each class of channels stays the same and the fraction a of users activated does not change, the form of 
Optimal Relaxed Policy does not change. Since each user is scheduled independently, the throughput v^{-y,a) 
is proportional to A^, establishing the lemma. Details are provided in Appendix |Cl ■ 

We hence refer to the (7,0) pair as 'system parameters'. Therefore A^r(7,a) provides a throughput upper 
bound to any policy in the same system under the stringent constraint ([3]). Equivalently, r(7, a) provides a 
per-user throughput performance upper bound to all policies that satisfies the stringent constraint. 

We next describe Whittle's Index Pohcy for the strictly-constrained problem (|2l)-(l3]l, and later study the 
closeness of its performance to the upper bound established here. 

IV. Whittle's Index Policy Description 

In this section we formally introduce Whittle's Index Policy for solving the stringently-constrained downlink 
scheduhng problem ©-([S]!. 

A. Whittle's Index Policy 

The Optimal Relaxed Policy, along with the Whittle's index values, gives consistent ordering of belief values 
with respective to the indices. For instance, under the Optimal Relaxed Policy, if it is optimal to schedule one 
channel, it is then optimal to transmit to other channels with higher index values. So the Whittle's index value 
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gives an intuitive order of how attractive the channel is for scheduling. This intuition leads to Whittle's Index 
Policy |[T4l under the stringent constraint on the maximum number of channels that can be scheduled. 

Whittle's Index Policy: At the beginning of each time slot, the channel i in class k is scheduled if its Whittle's 
index value Wk{TTi) is within the top aN index values of all channels in that slot, with arbitrary tie-breaking 
while assuring a toted aN channels being scheduled. 

Whittle's Index Policy is attractive because it has very low complexity, and it was observed via numerical 
investigations to yield significant throughput performance gains over the scheduling strategies that does not utilize 
channel memory ||15il . The main focus of our work is to analytically understand the approximate or asymptotic 
optimality of Whittle's Index Policy in asymmetric scenarios. 

B. Whittle 's Index Policy over Truncated State Space 

Recall from Section |II] that the belief values evolve over a countable state space, also note that if a channel 
is not scheduled for a long time, its belief value will get arbitrarily close to its stationary belief value. This 
motivates us to consider a truncated version of the belief value evolution whereby the belief value is set to its 
steady state if the corresponding channel is not scheduled for a large number, say r, slots. This mild assumption 
facilitates more tractable performance analysis of the policy. Thus, if a class k user is not scheduled for r time 
slots, its channel state history is entirely forgotten and its belief value will transit to the stationary belief value 
6^, where the truncation r is assumed to be very large. 

Whittle's Index Policy is then implemented over the truncated belief state, which differs from the non-truncated 
case merely in the truncated belief value evolution. We believe that, the truncated scenario can provide arbitrarily 
close approximation to the original system when r is large. More importantly, as we shall see in the following 
two sections. Whittle's Index Policy, implemented over the truncated belief state space, achieve asymptotically 
optimal performance as long as the truncation is sufficiently large. 

V. Local Optimality of Whittle's Index policy 

In this section, we study the optimality properties of Whittle's Index Policy for downlink scheduling, over a 
large truncated belief space. This result forms the basis for the subsequent global optimality result in Section IVll 
We start by introducing a state space over which the local optimality will be established. 

A. System State Vector 

We define the system state as a vector that represents the proportion of channels in each belief value, over 
the truncated space when the total number of users is N , i.e., = [Z Z ^'^] , with 

'7k,N _ lyk,N yk,N yk,N yk,N yk,Ni^ r, _ i r, 

^ — V'^O,! ' ■ ■ ■ ' '^0,T ' ' '^1,T 1 ■ ■ ■ ) ^1,1 j! ^ — J-i ^• 

where Z^'^ and Z^'^ respectively denote the proportion of channels in the corresponding belief state 6^^ and 
with respect to the total number of users N. Hence, each element of Z^ is a multiple of so that Z^ 
takes values in a lattice with mesh size Noting that the total number of users in each class does not change 
over time, for any N the system state Z^ [t] € Z where 

2 := {Z^ > : Zlf = ^u, k = 1, 2}. (6) 

c,l 

The system state vector does not distinguish users with the same belief state, thus its dimension will not 

scale with N. Therefore, compared with tt [t], it provides a more convenient representation of the system belief 
state. Furthermore, fully determines the instantaneous throughput gain in slot t under both Whittle's Index 

Policy and the Optimal Relaxed Policy (introduced in Proposition [T|l, because the instantaneous throughput gains 
under both policies are only determined by the distribution of the channels with different belief values, not their 
identities. 

From Lemma [T] and the subsequent remarks, under the operation of the Optimal Relaxed Policy, the belief 
state evolution of each channel is positive recurrent with a steady-state distribution. The following lemma also 
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establishes the independence of this steady-state distribution from N, and defines a useful parameter for future 
use. 

Lemma 3. Given the system parameters (7, a), the system state vector Z^[t\ under the Optimal Relaxed Policy 
converges in distribution to a random vector, denoted as [00] . The distribution of [00] is independent of 
N with its mean denoted as 

C^:=E[Z''[^]\. 

Proof: This lemma follows from a similar principle to the one we established in Lemma |2l For details, please 
refer Appendix iDl ■ 
It is easy to see that C^^Z and the form of C,^ fully determines the time average throughput of the Optimal 
Relaxed Policy. Therefore, the vector C° provides an important benchmark for our asymptotic analysis. If, in the 
long run under Whittle's Index Policy, the system state Z^[t] stays close to C,^, it indicates that Whittle's Index 
Policy will have throughput performance close to that of the Optimal Relaxed Policy - the throughput upper 
bound. To capture the closeness, we define the 5 neighborhood of as 

ns{q) = {Z ^Z:\\Z -C^\\<5}, (7) 

for (5 > 0, where | j • 1 1 stands for Euclidean distance. We are now ready to state and prove our first main result 
regarding a form of local optimality of Whittle's Index Policy. 



B. Local Optimality of Whittle 's Index Policy 

Under the system parameters (7, q), we let -R^(7, a, x) represent the time average throughput obtained over 
the time duration 0<t<T under Whittle's Index Policy, conditioned on the initial system state Z^[{)\ = x, i.e.. 



T-l N 



< (7, a, ^) := ii? [ J] 5^ [tyr^t] I [0] 



t=Q i=l 



X 



where {a'^^'^\t])i denotes the scheduling decision vector made by Whittle's Index Policy at time t. 

Recall from Lemma [2] that r(7,a) denotes the per-user throughput under the Optimal Relaxed Policy, which 
serves as an upper bound on Whittle's Index Policy performance. The next proposition characterizes the local 
convergence property of Whittle's Index Policy performance to r(7,a). 

Proposition 2. Under the system parameters (7, a), there exists a 5 > ^ neighborhood Q-s{C,^) of such that, 
if the initial system state x is within Q^iC^) . then 



lim lim 

T— >oo m^cxD 



R^'"{'y, a, x) 



= r(7,a). 



where {Nm}m <^i^y increasing sequence of positive integers with aN^, Ik^m ^ for k = 1,2 and all m. 

Proof Outline: Here, we give a high level description of the proof for an intuitive understanding, and refer the 
reader to Appendix |E] for the rigorous derivation. 

• We start by defining a fluid approximation, whereby the discrete-time evolution of Z^[t] under Whittle's 
Index Policy is modeled as a deterministic vector z[t] G Z that evolves in discrete time over Z and is independent 
of A^. Under this fluid approximation, the users are no longer unsplittable entities so that the state space of z [t] is 
no longer restricted to a lattice as it was for Z'^[t]. Also, the fluid approximation z[t] evolves in a deterministic 
manner, in contrast to the stochastic transition of Z^ [t] . The evolution of z [t] is defined by a difference equation 
as a function of the expected state change of under Whittle's Index Policy as follows 



z[t + l]-z[t] 



z[t]-- 



--E 



Z^[t + 1]-Z''[t] Z''[t]=z 



Nr 



(8) 



where N is any integer for which z is a feasible state. 

• We then establish local convergence of the fluid approximation model when z [0] is within a small enough 6 
neighborhood Qs{Cj) of C^- We show the convergence by first noting that the differential equation ([8]) is linear 
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within a wider convex region than Qs{C^)- Within this region, we obtain a closed form expression of the right 
hand side of (HJ, which enables us to investigate the eigenvalue structure of the linear differential equation. We 
show that each eigenvalue A satisfies |A| < 1 and apply standard linear system theory to establish the local 
convergence. 

• We then connect the fluid approximation model z[t] to the discrete-time stochastic system state Z^[t] by 
using a discrete-time extension of Kurtz's Theorem, which can be interpreted as an extension of the strong law of 
large numbers to random processes (see |[2TI ). Essentially, it states that, over any finite time duration [0,T], the 
actual system evolution [t] can be made arbitrarily close to the above fluid approximation z [t] by increasing 
the number of channels N sufficiently, with exponential convergence rate. 

• The previous convergence result, together with the local convergence result of the fluid evolution z[t\ to 
C^, enables us to establish the local convergence of the system state to as the number of users 
grows, provided that the initial state Z^[0] € Q.s{C,^). Hence the system state under Whittle's Index Policy will 
stay close (in a probabilistic sense) to the expectation C,^ of the system state under the Optimal Relaxed Policy, 
which, in turn, indicates that the throughput performance of Whittle's Index Policy will approach the throughput 
upper bound r(7,a), as expressed in the proposition. 

We again emphasize that the technical details of the outlined steps are fairly intricate and are moved to 
Appendix 10 ■ 

Proposition |2] illustrates an interesting local optimality property of Whittle's Index Policy as the number of 
users and the time horizon T increases while the system parameters (7, a) stay the same. It indicates that, 
under Whittle's Index Policy, as long as the initial state Z^[0] is close enough to C,^, the average per-user 
throughput over any finite time duration will get arbitrarily close to the Optimal Relaxed Policy performance as 
the number of users scales. 

Remark: We note that the sequence {Nra]m is used to guarantee that the number of channels in each class, 
as well as the number of scheduled users, take integer values. In fact, our result can be generalized to all N by 
slightly perturbing 7 and a as a function of N but assuring their limits are well-defined. 



VI. Global Optimality of Whittle's Index Policy 

The above local optimality result heavily relies on the initial state Z^[0] being close to C". which is difficult 
to guarantee. In this section, we study the global optimality of the infinite horizon throughput performance of 
Whittle's Index Policy starting from any initial state. We begin our analysis by presenting the recurrence structure 
of the system state. 

Lemma 4. Under system parameters (7, q), for any e > 0, if the number of users N is large enough, 

( i) The system state [t] evolves as an aperiodic Markov chain, in a state space that contains only one recurrent 
class. 

(ii) There exists at least one recurrent state within the e neighborhood Vte{C^) of C,^. 

Proof: We prove this lemma by constructing probability paths from any state to the neighborhood Q.^{C,^). Details 
of the proof are included in Appendix IF] ■ 

This lemma states that Z'^[t] will ultimately enter any small neighborhood of C,^ when N is large enough. 
Together with Proposition [2l this result shows promise for establishing the global asymptotic optimality of 
Whittle's Index Policy. This is plausible because once Z^\t\ enters ^^^{C,^), the performance of Whittle's Index 
Policy afterwards can get very close to its upper bound as N scales, as established in Proposition |2l However, 
since we consider the infinite horizon time average throughput, this argument would break down if the time 
it takes for to enter VlsiC^) also scales up with N. This observation motivates us to introduce a useful 

assumption, which will later be justified (in Section IVII-AI ) via numerical studies. 

Assumption ^: For each e>0, let T^ (e) represent the first time of reaching Q.^{C,!^) starting from Z^[0] = x, 
i.e., 

r^(e) = min{t : Z^[t] G a(C)|^'^[0] = x}. 
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Fig. 3: Transition behavior of Z [t] in steady state. 

Then, we assume that the expected time of reaching is bounded uniformly over N and x, i.e., there exists 

M,<oo such that E[r^ (e)] < for all N and x. 

Since for each N, Z^[t\ under Whittle's Index Policy is recurrent and aperiodic with a finite state space, there 
exists a steady-state distribution associated with Z^ [t] . As before, we use Z^ [oo] to denote the associated limiting 
random vector. The next lemma establishes that, under Assumption the distribution of Z^[oo] approaches a 
point-mass at as N scales. Here, again, the sequence {A^mlm is defined in the same way as in Proposition [2] 

Lemma 5. Under Assumption ^' and system parameters (7, a), for any e > 0, the steady state probability of 
Z^\t] under Whittle's Index Policy satisfies 

lim P(Z^'"[oo] Ga(C^)) =1. 

m— >oo ' 

Proof: The proof utilizes Theorem 6.89 from |[2ll . which builds on the following arguments. 

Note that e > can be selected to be small enough for the following argument. As depicted in Fig. [3l we let 
be a random variable denoting, in steady state, the time duration between consecutive hitting times into the 
neighborhood from outside of the neighborhood. Let denote the time duration from the time 

enters the neighborhood Q.^{C,^) from outside until the time it leaves. Hence, the expected proportion of time 
that Z^[t\ stays outside this neighborhood is {E[T^] - E[T^])/E[T^l 

We know that the numerator E\T^] — E\T^] is uniformly bounded for all N due to Assumption ^. However, 
as N increases, it is more likely for to stay within the neighborhood for a long time before exiting it 

(based on the convergence of fluid approximation model and Kurtz's Theorem in the proof of Proposition |2l). 
Thus, E\T^], and hence the denominator -^[Tg], grow to infinity as scales. Therefore, the expected proportion 
of time spent outside Q.^{C,^) vanishes as scales up, which leads to the statement of the lemma. Details of the 
proof can be found in Appendix [G] ■ 

Under Whittle's Index PoUcy with system parameters (7, a), we let (7, a) be the achieved infinite horizon, 
time average throughput, conditioned on the initial system state Z^[Q\=x, i.e., 

T-l N 

iZ^(7,a):= lim -ii^ [ J] J] vr,[t]ar'^[t] Z^[0] = a: . 

1 L ^ — ' ^ — ' J 

t=0 i=l 

From Lemma [5] we know that, in steady-state, the system state Z^™ [00] is increasingly concentrated around C° 
as m increases, regardless of the initial state x. We build on this to establish the global asymptotical optimality 
of Whittle's Index Policy. 

Proposition 3. Under Assumption \P, for any initial system state x, we have 

R^^i'y a) 
lim ^ =r(7,a). 

Since r(7, a) is an upper bound on the maximum achievable per-user throughput by any policy, this implies that 
Whittle's Index Policy is optimal in the many user regime. 

Proof: We prove this result by decomposing (7, a) as a summation of the expected throughput conditioned 
on whether the system state is within or outside an arbitrarily small e neighborhood of C^. Since the latter 
has diminishing probability according to Lemma |5l the expected throughput of Whittle's Index Policy can get 
arbitrarily close to that of Optimal Relaxed Policy. Details of the proof are provided in Appendix |Hl ■ 

Remarks: 

1) We would like to emphasize that the global optimality result is not a straight-forward extension of the 
local converge result by contrasting Proposition |2] and Proposition [3] Note that in Proposition |2l the time limit is 
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Fig. 4: Average time of hitting Qe(C")- (a) ^^[0] = x\ (b) Z^[0] = y. 



outside the Umit of the number of users N , where each convergence (with A^) is with respective to a fixed time 
duration. However, the order of limit is switched in the global optimality result of Proposition |3j as it states the 
convergence with N the infinite horizon average throughput, which is much stronger and hence is much more 
challenging to prove. 

2) We would like to contrast Assumption ^ with Weber's condition ||T6l. For general RMBP problem, Weber's 
condition leads to the same global asymptotic optimality result. While confirming Weber's condition may be 
possible in very low-dimensional problems, in our downlink scheduling problem, this requires one to rule out the 
existence of both closed orbits and chaotic behavior of a high-dimensional non-linear differential equation, which 
is extremely difficult to check - even numerically. Assumption ^, on the other hand, takes a much simpler form, 
as it is defined over the actual stochastic system and is amenable to easy numerical verification, as is performed 
in Section IVII-AI 

VII. Numerical Results 

A. Verification and Interpretation of Assumption 

We start by numerically verifying Assumption ^. We consider the asymmetric scenario with two classes of 
channels with system parameters 7=[0.45, 0.55], a=0.6, with j*i=0.9, ri=0.45, p2=0.8, r2=0.3. 

We next examine the change of the average hitting time (e), while maintaining a and 7. 

We let x,y £ Z he initial values of ^^[0] that are selected to be two extreme points in the state space to 
exhibit the uniformity of F^ (e) to the initial state. Specifically, state x corresponds to the case when all the users 
have just observed their channels to be in OFF state, i.e., with belief value 6q^, k = 1,2. And y corresponds 
to the case when all users have no initial observation of their channels state history, i.e., with belief value bg, 
k = l,2. 

We examine the average value of hitting time F^ (e) and F^ (e) with a very small neighborhood e=0.005, 
when the number of users N grows from lOx 10^ to 500 x 10'^. As indicated in Fig. |4l for both cases, the average 
time of hitting the e neighborhood first decreases with N, and then converges and stays almost the same as N 
scales up. This is especially intriguing. The rationale behind this phenomenon is as follows. Under Whittle's Index 
Policy, a total number of aN users are activated at each time slot. Therefore, for relatively small number of users, 
the amount of probabilistic belief state transitions, as well as the amount of system states in the neighborhood, 
increases with N, leading to a higher chance of hitting the desired neighborhood and smaller value of 

hitting time. However, the belief update of each user contributes to the change of the system state 
which decreases with A^. Therefore, as N further increases, the total amount of transitions of the system state 
due to channel state feedback is roughly aN ■ 1/N = a, which is invariant of N. This result, along with 
many other numerical experiments we have conducted that lead to the same observation, gives verification to 
Assumption 

B. 'Exploitation versus Exploration' Trade-off 

In this section, we demonstrate how the Whittle's index value captures the 'exploitation versus exploration' 
trade-off for our asymmetric downlink scheduling problem. 

Consider two classes of ON/OFF fading channels with belief value evolutions plotted in Fig. Ha). Note that 
both classes have the same stationary distribution 5j = 0.5, k G {1, 2} of being at ON state, but channels in class 
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Fig. 5: The evolution of belief value and Whittle's index value, (a) Belief value evolution (b) Whittle's index 
value evolution. 

1 has a higher degree of time correlation, i.e., fades slower, than channels in class 2 since pi > p2 and ri < r2. 
The corresponding Whittle index values of the two classes of channels are depicted in Fig. ^h) as functions of 
the updated belief value starting from different initial states. 

To understand the nature of Whittle's index value, we first consider the case when the channels in both classes 
are observed to be ON at time and stay passive since then. As indicated in Fig. [Sja) the class 1 channel has 
a higher belief value than the class 2 channel, hence scheduling the class 1 channel gives a higher immediate 
throughput than scheduling the class 2 channel. Moreover, once a class 1 channel is scheduled, it is more likely 
to stay in ON state again, bringing high future gains. Accordingly, the index values in Fig. |5lb) when both state 
evolutions start from ON states capture that it is more attractive to schedule the class 1 channel because of the 
advantage in both exploitation and exploration. 

On the other hand, when the scheduler has observed channels in both classes to be OFF at time 0, Fig. |5la) 
shows that the class 2 channel has a higher belief value than the class 1 channel. However, although the Whittle's 
index value in Fig. ^h) of class 2 channel is initially smaller than that of class 1 channel, after a certain amount 
of delay (around 8 slots in the figure) this order is switched, which is interpreted as follows: initially, since 
the class 1 channel has smaller belief value than that of the class 2 channel, it is more attractive to exploit the 
immediate gain brought by the class 2 channel. However, as the passive time grows, as indicated in Fig. Ha), 
the difference between immediate gain of both classes diminishes. Then, it becomes more attractive to explore 
the class 1 channel because its longer memory can bring higher future gains if it turns out to be in ON state. 

This investigation reveals the intricate nature of Whittle's index value in capturing the fundamental 'exploration 
versus exploitation' trade-off. In our scheduling problem with asymmetric channel statistics, such a property of 
Whittle's Index Policy turns out to be crucial in achieving asymptotically optimal performance. 

VIII. Conclusion 

In this paper, we studied the problem of downlink scheduling over ON/OFF Markovian fading channels in the 
presence of channel heterogeneity. We consider the scenario where instantaneous channel state information is 
not perfectly known at the scheduler, but is acquired via a practical ARQ-styled feedback after each scheduled 
transmission. We analytically characterized the performance of Whittle's Index Policy for downlink scheduling, 
and proved its local and global asymptotic optimality properties as the number of users scales. Specifically, 
provided that the initial system state is within a certain region, we established the local optimality of Whittle's 
Index Policy by investigating the evolution of the system belief state with a fluid approximation. We then 
established the global asymptotic optimality of Whittle's Index Policy under a recurrence condition, which is 
suitable for numerical verification. Our results establish that Whittle's Index Policy, which is attractive due 
to its low-complexity operation, also processes strong asymptotic optimality properties for scheduling over 
heterogeneous Markovian fading channels. 
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Appendix A 
Proof of Proposition [T] 

A. Lagrangian decomposition - Thresholdability 

The constraint © can be written in an equivalent form that requires at least (1 — a)N channels to be passive 
on average, i.e., 

T-l TV 

lnninf-£;[5^5;(l-ar[t])] > (l-a)iV. (9) 

°° t=0 i=l 

Associating a Lagrange multiplier u to the above constraint we consider the following Lagrangian function 
L{u,uj) of the relaxed problem (HI)-©, 

T-l N , T-l N 



L{u, u) 



liminfis[5]5^7r,[t].<[t] 7f[0]j + ^. liminf ^i^ [ J] 5^(1 - 7r[0]j -a;.(l-a)iV. (10) 



t=0 1=1 



t=0 1=1 



The dual function D{uj) is defined as D{lo) = maxueu L{u,uj). The following lemma provides a useful upper 
bound to D{uj). 



Lemma 6. 



N ^ T-l 

D{uj)<max V limsup I" V [vr^t] • a^[t] + a;-(l - iv[0]\ -a;-(l-a)iV. (11) 

i=l ^ t=0 
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Proof: 



T-l N 



D{u) < max Ihninf ^e[Y,Y. " <W + ^ ' (1 " c 

" °° t=0 i=l 

^ T-l N 

< max limsup ;^^[ V V • + w • (1 - 
<max: Vlimsup-^[V [vr^t] • a^[t] + w • (1 - 



^[0] -^•(l-a)iV 
^[0] -u}il-a)N 
ir[0] \ -uil - a)N, 



i=i ^-^--^ - t=o 

where the first and the last inequahty follows from the superadditiviey and subadditivity of hmit superior and 
limit inferior, respectively. ■ 

Consider the unconstrained optimization problem in the upper bound ([TT]) . it can be viewed as a composition 
of independent co-subsidy problems interpreted as follows. For each channel i at belief state vTj, it will receive 
a reward vTj when it activates, otherwise it will receive a subsidy uj for passivity. Here, for each channel, its 
reward only depends on the transmissions of its own and independent of decisions for other channels. Hence, 
the optimization problem in (fTTI ) can be decomposed into N u-subsidy problems. For channel i, the tj-subsidy 
problem is expressed as follows, 

T-l 



Vi{uj) = max limsup ^e\Y • a^i] + a; • (1 - 0^^])] tt^O] 

uGC/, T->oo ^ 



(12) 



where Ui denotes the set of scheduling decisions that activate and idle the channel according to the observed 
channel history. An important property for each w-subsidy problem is thresholdability, given in the following 
lemma. 



Lemma 7. The optimal policy for the uj-subsidy problem ([72]) is a threshold-based policy. Specifically, for each 
channel i in class k, there exists a threshold value 6k{(jj) € [0, 1] such that it is optimal to transmit when its 
current belief value 7rj[t] > 9k{uj), and to stay passive when 7rj[t] < 9j^{uj), with tie breaking arbitrarily at 
T:i[t] = ek{uj). 

Proof: The thresholdability has been proved in ||T4l assuming a different form of objective function than (fT2l ). 
In fact, thresholdability holds for the general optimization problem of (fT2]) as well, explained in details below. 



It was shown in llT4l that a stationary threshold-based policy u*g with threshold value 6i.{P,lo) is optimal for 



the /3-discounted tj-subsidy problem 



max E 

ueu. 



oo 

[^f3'[n,[t].at[t]+u-{l-am]\M0] 



t=o 



(13) 



for channels in class k, where /3 € (0, 1) is the discount factor. The optimal policy u*^ for (fT3l) activates the 
channels with belief values greater than 9j^{j3,ui) and idles the channels whose belief values are smaller than 
9k{P,io), with tie breaking arbitrarily at 9k{P,io). 

From Dutta's paper |[22l . we know that if a value boundedness condition holds for the discounted problem 
([T3] ). and if a family of optimal policy {u*^} converges to certain limit as /3 ^ 1, then (/> is optimal for the 
u-subsidy average reward problem (fT2l) that is defined with respective to limit superior. 

Indeed, it was shown in |[T4ll that as /3 — > 1, = lim^_j.i 0fc(/3, cj) exists and the value boundedness 

condition holds for the discounted problem ( fT3] ). Therefore the threshold-based poUcy is optimal for the problem 
(fni). ■ 



In the w-subsidy problem, we let 0{lo) = {9k{io), k = 1,2} denote the optimal threshold-based policy for the 
system. Because of the simple form of the threshold-based policy, we have the following lemma. 

Lemma 8. Given a Lagrange multiplier uj > 0, the threshold-based policy 9{uj) achieves the maximum value of 
the Lagrange function L{u,uj), i.e., 

D{io) = L{e{uj),io). (14) 
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Proof: Case (1). Suppose for channels in class k, 9k{uj) > b^. Due to the monotonicity behavior of belief 
evolution, channels with initial belief 7rj[0] < 6k{v) always stay idle. Channels with initial belief 7ri[0] > 9k{io) 
will be activated until its channel turns out to be and then remain idle henceforth. Therefore with probability 1, 
all channels will stay in the idle mode (see proof of Lemma [T] for detailed description). Therefore, the following 
Unlit inferior will coincide with limit superior and can be calculated. 



T-1 



T-1 



lim mi^E\Y^ [vr. [t] ■ af [t] n, [0]] = lim sup ^i? [ J] [vr, [t] ■ af [t] vr, [0] 
liminfiii;[^(l-af'^)[t]) ttM] = Mm sup ^Elj^i^ - af^^^t]) vr,[0] 



0, 



1, 



(15) 



(16) 



Case (2). Suppose for channels in class k, 6^{oo) < b^. From belief value evolution in Fig. |2l the belief 
values of each channel evolves as a positive recurrent Markov Chain (again, see proof of Lemma [U for detailed 
description). Therefore, the limit inferior and limit superior coincides, 

T-1 , T-1 

[i-J "iM — ' 

T->oo 



liminfi£;[j] hM-«f"^M =limsnp^E\Y^ hM-«f"^W vr.[0] 

■ t=0 

t=0 



i=0 
T-1 



lim inf „ 

T->oo T 



f4i:( 

t=0 



lim sup ■ . 

T-j-oo T 



(17) 
(18) 



From equation ([T5])-(fT8]). as well as equation ( fTOl ). we have, 

AT T-1 AT ^ T-1 



i=l 
N 



t=0 
T-1 



i=l 



i=l °° t=0 



1=1 
N 



t=0 

TTilO]] -uj-{l-a)N 



■Ki[0] -u:-{l-a)N 



:Yv^{uJ)-u}il-a)N 



i=l 



> Dioj), 

where the last inequality follows from Lemma [6] Because we also know L(^0{uj),uj^ < D{uj) since D[ 
ma'Kueu L{u,uj), we have D{uj) = L[6{uj),uo). 



UJ 



B. The oj-subsidy problem: Indexability 

For each channel in class k, let Iki^) ^ be the set of belief states for which, under threshold-based policy 
9{io), it is optimal to stay idle. From the thresholdability property, it is clear that Ik{^) includes all the belief 
values in Bk no greater than 9k{uj). 

For class- A; channels, we let a^{TT) denote the optimal decision at belief value vr e S'^ under subsidy uj. 
Following the definition in |[T3l . the Whittle's index value Wk{Tr), tt € B^, is given by the infimum value of 
subsidy co for which it is equally optimal activate or idle at belief vr, i.e., 

T^fc(^)=inf{L^:a^(7r) = {0,l}}. (19) 

The Whittle's Indexability condition, specific to the scheduUng problem, is defined as follows. 

Whittle's Indexability condition: The downlink scheduling problem is Whittle Indexable if, as co increases from 
— oo to oo, the set Xk{uj) monotonically increases from to B^- 

It was proved in |[T4l that the idle set Zk{^) indeed monotonically increases from to B^ as to increases from 
—oo to oo. Therefore, the downlink scheduling problem is Whittle indexable, recorded below. 



16 



Indexability Theorem. The downhnk scheduUng problem is Whitde indexable. 

It can be observed that, from Indexability condition, the Index value VFfc(7r)€ [0, 1] and Wfc(7r) monotonically 
increases with vr G B^. The next lemma gives the closed form expression of the index values. 

Lemma 9. Tlie closed form expression of Whittle 's index values is given as follows, 

Wfc(7r) = { ^~P>'+iK-<i+i)^+^li+i ~ ^ (20) 

Proof: The derivation of the Whittle's Index value is included in Appendix[Kl We remark that the Index expression 
Wfc(7r) is constant when 6^ < vr < p^, which differs from the indices derived in |[T4l . Such a difference is due 
to the definition ( fT9l ) of the index value and is explained in detail in Appendix [K] ■ 

With the definition of index value and the established indexability condition, the optimal threshold-based policy 
can be implemented in a more efficient manner, characterized in Lemma [TO] Here instead of maintaining different 
threshold values 9i{uj) for each w, the scheduler simply compares the index value with uj to decide weather to 
transmit on the channel. 

Lemma 10. Under the uo-subsidized problem, at each time slot, for the i*'' channel in class k, it is optimal to 
transmit when Wk{TTi)>u), and to stay idle when Wk{TTi) <u}, with tie breaking arbitrarily if Wk{'n'i)=u). 

Proof: If Wfc(vrj) > u, from definition ([T9l ) of the Index value, Wk{T^i) is the minimum subsidy required for 
the belief value vrj to be within the idle set. Since Wk{T^i) is higher than the actual subsidy uj, it is optimal to 
activate the channel at subsidy uj. 

If Wk{7Ti) < UJ, similarly we know the subsidy uj is higher than the minimum subsidy value Wfc(7rj) such that 
it is optimal to stay idle at tTj. Hence, from Indexability, it is optimal to stay idle at tTj at subsidy uj. 

If Wk{TTi) = UJ, from the above definition ([T9l ) of Index value, it is equally optimal to activate or idle at tTj. ■ 

C. Optimal policy for the relaxed problem 

We have thus far seen from Lemma [10] that the dual function D{uj) can be achieved by a threshold-based 
policy implemented over the index values. We now proceed to identify the optimal policy for the original relaxed 
problem dH)-©. 

Let p) denote the policy where the channels with the index value greater than uj activate, channels with 
the index value smaller than uj remain idle, and the channels with index value uo activate with probability p. 



Lemma 11. Given a, there exists a unique pair [uo* ,p*) such that, under policy (f){uj*,p* 

T N 

aN. (21) 



T N 



t=l i=l 

Proof: For a single channel i in class k, consider the policy where the channel activates if its belief value tTj > h^, 
stays idle when vTj < h^, and activates with probability p when vr/ = b^, for some belief value b^. From the belief 
value evolution we can calculate the expected time of activion, denoted by A^{b^,p), 

( 1 (i-p.)(fe-p) r , fc ^ , fc 

A^ib^ p) = { pb'^a,,M^-pK,^+i+(^-P>')i.h+i-p) " ''o./i' ^22) 
\o if^>&^ 

It is clear from its expression that, given b^, A^{b^,p) is continuous with p. Also we have A^{b^^,Qi) = 
A^{b^ In addition, some simple algebra reveals that, given b^^, A^{b^^^,p) strictly increases with p. 

Therefore, since A'^(6q^,0) = j4'^(6q ^^j^, 1), given p A^{b^,p) monotonically decreases with b'^ G Bk- 

Also, one can observe from (l22l) that, given p, \mih^^ A^ [b^ p) = and j4'^(6q^,1) = 1. Hence by 
appropriately choosing b'^ and p, A^{b^ ,p) can achieve any value within [0, 1]. 

Recall from the definition of indexability that the index value Wk{b^) monotonically increases with b^ G II'^, 
= 1, 2. It follows from the above analysis that, as uj increases, under policy (t){uj, 1), the fraction of activation 
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time for each user strictly decreases from 1 to 0. Therefore, there exists an unique {uj*,p* pair, such that the 
policy (f){uj*,p*) strictly satisfies activation constraint. ■ 
We now consider the relaxed optimization problem dU!-© and the policy 4>{ijj* , p*) as specified in the previous 
lemma. Clearly, the policy 4>{lo*,p*) is primal feasible and the lagrange multiplier uj* is dual feasible. From 
Lemma [TOl (p{uj*,p*) is optimal for the a;*-subsidy problem and hence D{uj*) = L{(p{uj* , p*),uj*). Furthermore, 
according to (|2TI ). (/)(a;*,p*) activates aN users on average, and thus the complementary slackness condition 
holds for the primal-dual pair ((/)(w*,/?*),a;*). From the optimality condition for primal-dual optimal solution 
ll23l . (^4>{uj* , p*) , uj*^ is an optimal primal-dual pair. Therefore uj* € argmin^ D(a;) and 4'{u*,p*) is the optimal 
solution to the relaxed problem. Letting (p* represent (p{uj*,p*), we thus have proved Proposition [T] 

Appendix B 
Proof of Lemma [T] 

(i) First consider the scenario where uj* < Wk{bg) and suppose co * = Wfc(6^^0 for the belief state 6^^.. If 
the belief value of a channel is above 6q at the beginning of a slot, the channel will be activated. According 
to the belief value evolution rule ([T]), in the next slot its belief value will either be pk or r^, depending on the 
underlying channel state revealed at the end of a slot. Clearly, the belief evolution in this case is positive recurrent 
within a finite state space, i.e., the belief state can only take the values p^, r^, feg 2? ' ' ' j^oh'+i- other 
hand, if the belief value is below 6q ^. , the channel remains idle and will activate once its belief value exceeds 
6q . Fig. |6] illustrates the belief evolution in steady state under this scenario. 

(ii) Consider the scenario where uj*>Wk{bg). In this case, a channel is activated if its index value is above 
to*. After transmission, if the channel is observed to be in OFF state, its belief value will transit to and stays 
idle until its index value crosses oj*. Since OJ* >Wk{bg), it is clear from the belief value evolution (see Fig. |2]) 
that, starting from r^, the belief value will always be smaller than bg. Hence the channel will stay idle at all 
times. On the other hand, if the channel is observed to be in ON state after transmission, the belief value will 
transit to pk and the channel will keep on transmitting until the underlying channel turns out to be in OFF state. 
Since we assumed p^ < 1, the channel will ultimately be in OFF state and its belief value will transit to and 
stays in idle mode ever since. Therefore eventually no channel in class k will be scheduled and the belief values 
will keep transit toward, but never reach, the steady state belief value b^. 

Appendix C 
Proof of Lemma [2] 

Consider two systems with different total number of users but identical a and 7. Suppose the first system has 
A'^i total number of users while the second system has A^2 number of users. For the first system with A'^i total 
number of users, suppose the policy specified in Proposition [T] is optimal for the relaxed-constraint problem. 
Therefore from the proof of Proposition [TJ (p* is optimal for each individual a;*-subsidized problem (fT2l) . For each 
channel in class k, we let A^, denote the expected fraction of time of activatoin, which is expressed specifically 
in equation (l22l ). Then, according to Proposition [Hii), the expected number of activated users satisfies 

71 A^i • Al, + j2Ni • Al, = aiVi . (23) 
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Now apply the same policy (j)* when the total number of users is A''2. Since (p* schedules each channel 
independently, A^^, and A^^, does not change in this scenario. Therefore, the expected number of activated users 
is expressed as 

7iA^2 • A\. + J2N2 • = ^ [71 ^^1 • A),, + 72iVi • Al,] = aN2, (24) 

hence the complementary slackness condition for the relaxed-constraint problem is also satisfied under </>* and 
00*, when the total number of users is In this case, since is still optimal for each individual a;*-subsidized 
problem and both (f)* and lo* are feasible, {(/)*, lo*) is a primal-dual feasible pair when the total number of users 
is N2, by the same argument as in the proof of Proposition [T] 

Therefore, fixing system parameters (7, a), for different number N of users, the policy (p* is always optimal. 
Since the policy (j)* schedules each channel independently, we let Vki'J, a) denote the expected reward contributed 
by each channel in class k. Hence we have 

T;^(7,a) = N-fiVi{'y,a) + N-f2V2{'Y,a)- 
Therefore the per-user throughput is 

i'y Ci) 

= 71^1(7, a) +l2V2{l,Oi), 

which is independent of N. Hence the lemma is proven. 



Appendix D 
Proof of Lemma [3] 

Given system parameters (7, a), we know from the proof of Lemma |2] that the form of the Optimal Relaxed 
Policy, denoted by (j)* , does not change with the number N of users. Since cj)* schedules each channel indepen- 
dently, we let vector = [eg u ' ■ ■ > ^0 r' ' ^1 r' " " " 1 ^1 il denote the steady state distribution of the belief value 
of users in class k under (j)*, with + "^Zch^^h — ^- Therefore, 

i?[Z^(oo)] = l[7iAr£i,72iV£2] = [^^e\^2£\ 

Since 4>* is independent of N, is independent of for A: = 1, 2. Therefore E[Z^ {00)] is independent of 
the user number N, which proves the lemma. 



Appendix E 
Proof of Proposition [2] 

A. Notations 

We shall denote the i^^ element of as Z^[t\, and let /3j denote the corresponding belief value. The index 

value corresponding to /3j is denoted as wi. In this proof, since we are fixing the system parameters (7, a), we 
shall drop the suffixes a and 7 to denote C,^ as C,. 

For ease of exposition, in this proof we assume Wk{h'^j^,_i) < Wi{b}^i^,^ = uj* < Tyfc(6Q^.). Hence, in the 
optimal relaxed problem, channels in class 1 are activated when their belief values are above b\ ^, and stay idle 
if their belief values are below 6q/j., and activates with probability /0*€(0, 1) at b\^,. For channels in class 2, 
they are activated when their belief values no smaller than fep ^, and stay idle otherwise. 
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B. Transition properties of the system state 

We first investigate tiie belief transition structure of the system state under the Whittle's Index Policy. It 

is clear that evolves as a Markov Chain. We define the expected drift V Z^[t\ associated with the transition 

of Z^[t] as follows, 



VZ^[t] = E[Z^[t + 1] - Z^[t] I Z^[t]\ . (25) 

For a channel with belief value /3j, we let ■ and q] ^ be the probability that its belief state changes to state 
(ij under the idle and transmission actions, respectively. For example, if /3j corresponds to belief value 6q;, then 
<iii-\-i = 1 if the channel stays idle, otherwise q}-^ = 1 — 6q ^ and (?j^2t+i ~ ^o«' which corresponds to the 
probability of observed channel being or 1, respectively. Under the Whittle's Index Policy, we let gi{z) be the 
fraction of users in belief value pi that are activated, which is expressed as. 
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^'%7-'^'^'' ] ,l}, if^i/O, 
9i{z) = \i^ " if = and a- E»,>^. > 0, (^6) 

if Zi = and a-Y^^^^^^Zj < 0. 

where [•] = max{0, •}. We use qij{z) to denote the probability that the belief value of a channel transit from /3i 
to /3j under system state z. Then qij{z) is expressed as 

qij{z) = gi{z)q}j + (l - gi{z))q'lj. (27) 

We shall let en = 0, and let eij^i 7^ j be a vector that has —1 at the i^^ element, +1 at the element, 
and at all other elements. Hence if a user changes its belief state from Pi to /3j, the corresponding change of 
the system state is in the direction of Cjj with magnitude Therefore, VZ^[t\ is a composition of 

expected changes in each dii^ection Cy . Suppose Z^[t] = z, since the expected amount of change of in 
direction is the expected drift VZ^[t] can then be written as, 

^Z''[t]\^ = mai^) ■ ei, := Q{z)z, (28) 

where the {i,jY^ element of matrix Q{z) is expressed as 

Q.,(z) = |-^^-^^'^^^-(") '°"=^' (29) 
[qji{z) forz/j. 

Note that, although the system state z can only take values on a lattice that depends on N, the matrix function 
Qij{z) is defined over more general space Z. Based on this, we proceed to define a fluid approximation model. 

C. Fluid Approximation Model 
We consider a fluid approximation model z[t], which is defined by the following difference equation 

z[t + 1] - z[t] = Q{z[t])z[t]. (30) 

Note that the right-hand-side is completely determined by equation (l26l)-(l29l). as a function of z[t\ and is 
independent of N. We denote z[t] as the 'fluid approximation model' because z[t] is no longer restricted to take 
values on the lattice as with the case of the original system state Z^[t], and z[t] evolves in the direction of 
the expected change of the system state Q- Recall that the set Z is defined in equation (O, we proceed with the 
following lemma. 

Lemma 12. // z[0] G Z, then z[t\ G Z for all t > 0. 



'Note that by 'fluid' we mean fluid in users/channels instead of fluid with respective to time. 
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Proof: Since from ( |28] ) we have 

z[t + 1] - z[t] = Q{z[t])z = Zi[t]qi,{z[t]) ■ Bij. 

z[t\=z — ' 

Note that the belief values of a channel can only evolve within the belief states of class of the channel, hence 
for class 1, 

2t+1 2t+1 

i=l i=l l<ij<2T+l 

= E z,[t]q,,iz[t]) ■ {1 - 1) 

l<«,i<2-r+l 

= 0. 

where 1 is a vector with 1 in each element. Similar result holds for class 2. Since z[0]€2^, we have 

2r+l 2{2t+1) 

Y Zi[t]=ll, Yl =72, vt >0. 

i=l i=2r+2 

Also equation (I28]l-(l30l) indicates that Zi[t]>0 for all t>0 if z[0] £ Z. Therefore z[t]GZ for all t > 0, 
establishing the lemma. ■ 

Lemma 13. Tiie vector C, is the unique fixed point ofthefiuid approximation model, i.e., for all z £ Z, Q{z)z = 
if and only if z = C,. 



Proof: The proof follows from a similar line of 111611 . Note that, under the Optimal Relaxed Policy, Q = 
£^[Z^(oo)] and a fraction of channels are activated on average. Therefore, in the fluid approximation model, 
we have z[t + l] - ^[t] ^^j^^- = 0, i.e., g(C)C = 0. 

Now suppose there exists another fixed point C,o £ Z such that Co 7^ C ^^d (5(Co)Co = 0- Then Co corresponds 
to the stationary distribution of the system state under another policy 0(a;o,Po) with threshold parameter ujq and 
randomization factor po- Furthermore, under (/>(a;o,po)> the expected fraction of activated channels equals to a. 
However, this contradicts with Lemma [TT] which states that [u* , p*) is the unique parameter pairs that strictly 
satisfies the average constraint of activation. Therefore, the fixed point C is unique. ■ 

D. Convergence of the Fluid Limit Model 

Define the region J^^* Q Z as the set of z £ Z such that, under the Whittle's Index Policy defined in 
Section UVl the channel is activated if and only if its index value is no smaller than to*, which is the threshold for 
the Optimal Relaxed Policy defined in Proposition [T] This means that, at system state z € J7L;», all channels with 
index value higher than uj* are scheduled, and the channels with index value smaller than to* stay idle, while the 
channels at index value a;* are scheduled with certain randomization. Specifically, J^* = {z^Z : J^i-w yw < 

The following lemma characterizes the linearity property of the fluid approximation model in J^^* . 

Lemma 14. (i) The vector C G Jw'- 

(ii) The fiuid difference equation f l30D is linear within the region J^*, i.e., there exist matrix Q* and 
vector a* such that 

z[t + 1] - z[t] = Q* ■ z[t] + a* , for all z[t] a Jo,'. (31) 
Proof: (i) The vector C G Jw because, if z[t\ = C, we have '}2i-w->uj* 9i{^[A)^i\t] = 
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(ii) Recall that, at the beginning of the section, we have assumed oj* = Wi(6q i^,) for the beUef value 6q ^. of 
class-1 channel. The difference equation (l30l ) becomes. 



z[t+l]-z[t\ = V Ziqij{z)-eij + Zhiy2qhij{z)-ehij 

z[t]=z , ^ ^ 
ij-iy^hl j 

= Yl ^ilij(^) ■ ^ij + Y 'iKi ■ ^^>'lo + S-hi X] \-^Kj ~ ^Kj\ ■ ^^/^^i- ^^^'> 

iJ-ij^K j j 

where the second equality is from (|27] ). 

Since the total fraction of users activated is a, we have 

9hi{z)zhi= a- Y (33) 

Substituting the expression (1331 ) back in (l32l ). and noting that qij{z),i^h\ stays constant for z^J^^- (since the 
threshold w* for activation does not change for z G i7u;*)> the linearity property holds. ■ 

From Lemma [T2l we know that z[t] € Z for all t > 0, i.e., 

2r+l 2(2t+1) 

Yli^^^ ^1' X] = ^2- (34) 

i=\ i=2T+2 

Taking note of Lemma [T2l instead of using a 2(2t + 1) dimensional vector 2;, it suffices to represent the system 
state by a 2 • 2t dimension vector z, i.e., 

^ = [zi,--- , 2;/j*_i, Z/jj+i, • • • , Z2T+h'^-l, Z2T+h'2+l, ■ ■ ■ 22(2t+1)]- 

in which elements z^* and Z2r+h% are eliminated from z. The transition of z[t\, when z[t\ G X;., is obtained by 
substituting the relationship (l34l ) in the difference equation (l32l ) and eliminate the elements Zh* and Z2T+h%, i-e^ 

+ = [/* + (35) 

where the matrix U* and vector h* are obtained after the substitution. The next key lemma captures the eigen 
structure of matrix U*. 

Lemma 15. Each eigen value A of U* satisfies | A + 1 1 < 1. 

Proof: The proof is based on explicit study of matrix U* and is given in Appendix H ■ 

This lemma leads to the local convergence of z[t]. 

Lemma 16. There exists a positive constant a such that, if the initial state 2 [0] = x of the fluid approximation 
model is within the a neighborhood of C> where 0,^{<^) C J^., then 

(i) z[t] G X)* for all t > 0; (ii) z[t] ^ C '^s t ^ 00. 

Proof: Corresponding to ^, we let C represent the stationary expectation of vector z [t] . Therefore, from Lemma[T3l 

U* -C + b* = 0. (36) 

Substituting (l36l ) in equation (l35l) . we have 

z[t]-C = {U*+l)\x-C). (37) 

Since we have assumed that p* 7^ 1, there exists a do neighborhood ila-oiC) with Qfj^^{C) Q Ju'- Correspond- 
ingly, there is a neighborhood of C, for which z[t\ evolution is linear and is described by dTTl ). From Lemma [T6l 
each eigen value A of ([/*+/) satisfies |A| < 1. According to the stability theory of linear systems [261, z[t] 
converges to C, if the initial state is close enough to C,. 

Therefore, there exists a o" < ctq neighborhood of C, for which if the initial state x G ^^^(C)' ^[A ^ ■Juj* ^^d 
2[t] ^ as t 00. ■ 
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E. Convergence of the system state 

The fluid approximation model provides a good estimate for the system state evolution when the number of 
users is large, captured in the following proposition, which can be viewed as a discrete-time version of Kurtz 
theorem |[24l applied to our problem. 

Proposition 4. There exists a neighborhood ^^{C,) of C, such that if [{)\=z[{)\=x G il(5(C). then for any /i > 
and finite time horizon T there exists positive constants Ci and C2 such that 



sup \\Z''[t]- z[t]\\> f^i) <Ciexp{-NC2), 

where 6 < a, and Px denotes the probability conditioned on the initial state Z^\fS\ = x. Furthermore, C\ and 
C2 are independent of x and N. 



Proof: Consider the random variable Z^[t + 1] given Z^[t] = z, 



Z^[t + l] = z^[t]+ 



N 



(38) 



where r]^j{z) is an indicator function representing whether the belief value of the h^^ user in belief value /3j 
transits to belief value f3j at the next time slot. Note that, given Z^[t] = z, the scheduling action for users in 
belief state Pi is independent of N because the scheduling decision only depends on the belief state distribution 
z. As increases and z stays unchanged, more users is in belief state /3j and the contribution of each channel 
to the transition of Z^ scales down with N. From the law of large numbers, if the number of users scales up 
while Zi is kept the same, we have 



lim 

iV->oo 



N 



Nzi 2Zh=imM 



lim 



Nzi 



Ziqij{z) almost surely. 



Lemma 17. There exists a neighborhood ^e{C) of C such that, if Z^[t] = 2; € ^^e(C)> there exist c\ and C2 for 
which [t + 1] satisfies 

p{\\Z^[t + !]-(/ + Q{z))z\\ > fj,\z^[t] = < ci exp(-iVc2), 
where ci and 02 are independent of z and N. 

Proof: Let Ij be a vector with 1 at the i*'' position. From (l38l) . 

Z^[t + 1] - {I + Q{z))z 



N 



2(2r+l) ^Nz^ h (^^ 
2(2r+l) ^Nz, h 

2^h=iVij\^) 



N 



2(2r+l) ^Nz, h 



N 

_2(2r+l) ^Nz 



N 



Q{z)z 

2(2r+l) 

2(2t+1) 
2(2r+l) 



(1. - 1^ 

2{2r+l) ^Nz, h / N 



2(2r+l) 



N 
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Note that 



2(2r+l) 

E 



2(2r+l) ^AT^, ^ft^^^ ^ 



2(2r+l) ^2(2r+l) h ( ^\ 



i=l 
2(2t+1) 

1=1 



2(2r+l) 

2=1 



2(2r+l) 2(2r+l) 

i=i j=i 



Therefore 



Z^[t + l]-(/ + Q(z))z= 



iV 



(39) 



Note that once a user is activated, its belief value will only transit to pk or Vk, therefore r]^j{z) ^ only for 
j € := {1, 2r + 1, 2r + 2, 2(2r + 1)}. Also note that for those channels that stay idle, there is no randomness 
associated with its belief transition, i.e., for them i^^Az) = qij{z) G {0,1}. Therefore the randomness is only 



associated with the channels which are activated, i.e., those with index value no smaller than to*. Hence, ( 1391 ) 
becomes 

jeeien,(z) 

where the summation J2h=i ^'^^' (') ^^^^ channels in belief state f3i that are activated, and 11^ (2;) is the 

set of belief values in which channels are scheduled within the class that corresponds to belief j G 0, i.e.. 



{1 <i <2t + 1 : gi{z) > 0} ifj = l,2r + l, 

{(2t + 1) + 1 < i < 2(2t + 1) : gi{z) > 0} if j = 2t + 2, 2(2t + 1). 

For each j G G, we have 



n,(z) : = 



P^i\\Z^[t+l]-{l+Q{z))z\\>fi 



Z''\t]=z 



< 



N 

Yp(\ y1 ^'y''-^^^''^'^^^^ 

j&0 i£Uj(z) h=l 



N 



(40) 



where the last inequality holds because |0| = 4 and also from union bound. Specifically, the union bound holds 
since 

g,{z)Nz, h , 



{IE E 



E^i^" mz)-q.,{z)) 



N 



jee ieUj{z) h=i 



Vijiz) - qijiz) 



N 



From an extension of Chebychoff's inequality (See Excercise 1.8 in 11211 ) we have that, for each j G 0, there 
exists a positive continuous function fjifi), which does not depend on z and N, with 



^(1 E E 

ieHjiz) h=l 



Vij[z) - qij[z) 



N 



> - j < exp ( - fjifi) Y 9i{z)Nzi) . 

i&lj{z) 



(41) 



Let Uj be the fraction of channels activated, under the steady state of Optimal Relaxed Policy, in the class 
corresponding to belief value /3j, i.e., 

[T.tfm{C)Q ifj=l, 2r + l, 
VLlZr+l 9^{C)C^ if j=2r + 2, 2(2t + 1). 



a,- 



(42) 
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For any < i < min{aj,j € 6}, there exists a neighborhood rie(C) such that for all z G r2£(C)> 

^ gi{z)zi > aj -£, j £ e, 



(43) 



which essentially means, under system state z G J^e(C)> the fraction of activated channels in each class will stay 
close to the case when system state is actually ^. Let f{fi) = mm{fj{fi){aj — £),j E ©}, then from (|401)-(|43]). 



gi(z)Nzi h ( \ ( \ 

p^{^\z^[t+i]-[i+Q{z))z\\>^^^z^[t] = z)<Y^p{\Y, Y: h^^'^ - '^'^^'^ 

jee ierij h=i 
<4exp(-/(^)iV). 



> 



It is clear from the proof that f{fi) does not depend on z or N. Letting ci = 4 and C2 = /(/i), the lemma 
thus holds. ■ 



Lemma 18. There exists a neighborhood ^^^(C) of C, for which, if Z^[0] = a; G f^5(C)> for any t > 1, there 



exist c\ and with 



(44) 



- z[t]\\ > fij < c\exp{-Nci), 

where c\ and are independent of x and N. 

Proof: Recall that a and e are defined in Lemma [16] and Lemma [T7J respectively. We let 6 < m.m{a, e} be 
such that, if z[0] G f^<5(C)> £ ^e-p{C) for all t > 1 where p is a constant with < p < e and satisfies 

II [Q{x) + I^x — {Q{y) + /)y II < V, for all x,y G Z with ||a; — y\\ < p. 

for positive constant u < fi. We proceed to prove this statement by induction. 
For t = 1, if a; G f^<5(C)> from Lemma [TTl there exist c\ > and > 0, 

P,(||Z^[l]-z[l]|| >/i) =P^{\\Z''[1]-{I + Q{x))x\\ >f?j <c\eM-clN). 

Suppose the statement is true at t > 1, then there exist d\ and ^2 for which, 

P,(||Z^[t + l] -2[t + l]|| >f?j 

=P,(||Z^M-z[t]|| >p)p,(||Z^[t + l]-z[t + l]|| >^|||Z^[t]-z[t]|| >p) 

+ P,(||Z^[t] -z[t]|| <p)p,(^||Z^[i + l]-z[t + l]|| >^|||Z^[t]-z[t]|| <p 

<4exp(-4iV) + P^(^||Z^[t + l] -z[t + l]|| >fi \\Z^[t]-z[t]\\ <p) (45) 

Now consider the second term in (|45] |. 

P^(||Z^[t + l] -z[t + l]|| >fi\ ||Z^[t] -z[t]|| <p) 

=P,(||Z^[t + !]-(/ + Q(Z^[t]))Z^[t] + (/ + Q(Z^[i]))Z^[t] - z[t + 1]|| > ^1 ||Z^[t] - z[t]|| < p) 

<P,(||Z^[t + l]-(/+Q(Z^[t]))Z^[i]|| + ||(/+Q(Z^[t]))Z^[t]-(/+Q(z[t]))z[t]|| >/.| ||Z^[t]-z[t]||<p) 

<P,(||Z^[t + l]- (/+Q(Z^[t]))Z^[t]|| >/i-z.| ||Z^[i]-z[t]|| <p). 

= Yl ^'a,(^^[t]=^|^^W Gf^p(^M))P.(||^^[t + l]-(/+QW)2|| >/i-^| Z''[t]=z) (46) 
where the first inequality follows from triangle inequality, and the second inequality is from relationship ((44)) . 
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Since C ^^^(C), from Lemma [TTl for z G r2p(z[t]), there exist positive constants c\ and C2, that do 

not depend on z or N , with 



+ - (/+Q(z))z|| >^-i/ =zj <ciexp(-C2iV). (47) 

Substituting (|47]l to (|46ll, we have 

pJ\Z^{t^\\- z\t + \\\>li < < ciexp(-C2iV). (48) 



Hence from Equation ( |45l ) and ( |48] ). there exist constants c*^^ > and (^^^ > that do not depend on z and 
TV with 

P^(||Z^[t + l] + >/x) < c*+iexp(-Af4+i). 

By induction, the lemma holds. ■ 
Note that from union bound, 

T-l 

PJ sup ||Z^[t]-z[t]||>^) < J]pJ||Z^[t]-z[t]||>/x). (49) 

Therefore, from Lemma [TSl over finite time horizon T, there exist positive constants C\ and C2, which do 
not depend on x and N , such that 

sup ||Z^[t]-z[t]|| >/x) <Ciexp(-iVC72), 

^0<t<T ^ 

which concludes the proof of Proposition ID ■ 

According to Proposition |4] we have just established, the system state [t] behaves very close to the fluid 
approximation model z\t\ when the number of users N is large. Since we have shown the convergence of z[t] 
to C within r2cr(C) in Lemma [T6l we are ready to establish the local convergence of the system state to C,. 

Lemma 19. If Z^\^ = a; G ^^^(C). then for any fi > there exists a time Tq such that for each T > Tq, there 
exist positive constants si and S2 with, 

pJ sup -Cll > /i) < siexp(-iVs2). 

^To<t<T ' 

Proof: We let < < /i. Noting that b < o, from Lemma [T6l we have, given z[0] = a; € ^2^(0, there exists Tq 
such that for aU t>To. 

\\z[t] - C\\ < 

From Proposition |4] we know that there exist positive constants si and S2 such that, 

pJ sup ||Z^[t]-C|| >/i) <P,( sup \\z^[t\-z[t\\\ + \\z[t\-C\\>lJi) 

^To<t<T ^ To<t<T 

<P^{ sup \\Z^[t\-z[t\\\> ii-v) 

To<t<T 

<Px{ sup ||Z^[t]-z[t]|| > fi-u) 

0<t<T 

<si exp(-iVs2). 

Hence the lemma holds. ■ 

The previous lemma allows us to establish the local convergence result. Let v : Z ^ TZhe a. mapping such that 
v{z) represents the per-user average throughput under system state z. Therefore, Nv{Z'^[t]) is the immediate 
reward at time t and we also have r(7,a) = v{C)- 

For £ > 0, we let /i > be such that for any x €^ Z, if \\x — C\\ < fi, then 



\vix) - v{C)\ < t (50) 
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Note that the per-user instantaneous throughput v{z) < 1. Therefore, 

:,N„^. _ . T-1 



Rrp'"{'y, a, x) 



1 



\N„.T 



To-l 



t=To 



= 1^ E EiviZ^'-lt]) - v{C)] +^Y1 EHZ^'-it]) - v{C)] 
t=o t=T„ 

< 1^ £ EHZ'^-IA) - v{C)] I + 1^ E ElviZ'^-it]) - v{C)] 
t=o 



(51) 



t=Tn 



Letting Aj\f^ be the event {sup7i^<^<y — C|| > /^}» we proceed to bound the second term in (ISTI ). 



-Y^E[\v{Z''-[t])-v{C)\ 



t=To 



t=To 

<P^{ANj + {l-PdANj)i 
=P^{An J{1 -£)+£. 

where the inequality if from the fact v{z) < 1 and the relation (l50l ). 

According to Lemma [T9l when x £ ^^^(C), we have liuim^oo Px{An„J = 0, therefore, 



T-l 



7N„ 



viOl 



A 



N„ 



+ {l-P^{Ar,J)-Y,E[\v{Z''- 



t=To 



v{0\ 



A 



N„ 



lim 

m— >oo 



i?^"(7,a,a;) 



Since I can be arbitrarily small, we have 



R!^-{-f,a,x) ^ ^, ^ To 
lim ----- 



Hence, taking Unlit with T in both sides. 



Nr, 



r{l,a)\ < —. 



lim lim — — 

T->oo m->oo N, 



--r{j,a). 



We have thus proved Proposition |2l 



Appendix F 
Proof of Lemma |4] 

(i) Here we prove the Markov chain has one unique class by stating that, starting from any state, there exists 
a possibility to reach a particular state, and hence there is only one class of recurrent state. 

Case (1). Suppose a < 71. Starting from any initial state Z^[0], the following transition can occur: whenever 
the channels in class 1 are activated, their states are observed to be in ON state, and whenever channels in class 
2 are activated, they are revealed to be in OFF state. Then after a long enough time duration ti, a fraction of 
channels, which are in class 1, will be in belief value pi, and other channels will have stationary belief value 
vr^. Hence the system state will be Z^[ti] = [Z^'^[ti], Z^'^[ti]] (defined in Section EA]) with zlf[ti] = a, 
Zl'^[ti] = 71 — a, Zs'^[ti] = 72, and with in all other positions. 

Case (2). Suppose a > 71. Starting from any initial state .Z^[0], consider the following transition path. Within 
the first period of time slots, < t < to, whenever users in class 1 are activated, they turn out to be in state 1, 
and whenever users in class 2 are activated, they turn out to be in state 0. Then if Iq is long enough, Z^'^[to] 
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is such that [to] = 7i, with zero in all other elements. In the second period, to < t < ti, whenever users 
in class 1 are activated, it will remain in state 1, and whenever users in class 2 are activated, it turns out to be 
in state 1 as well. Then after long enough of time until ti, Z^[ti] = [Z^^^[ti], Z^'^[ti]] with zlf[ti] = 71, 

2 TV 2 A'' 

^11 = ~ 7i' ^nd Zs' [ti] = 1 — a, with zero in all other elements . 

Since the state space of the Markov Chain Z^[t] is finite, there is at least one recurrent class. As we have 
seen in the above cases that, starting from all states, Z^[t] can reach a particular state. Therefore there can only 
be one recurrent state. We shall henceforth denote this particular state as Z^ . It is also clear from the proof that 
the Markov chain is aperiodic because of the possible self-transition in state Z^ . 

(ii) Similar to the proof of Proposition |2j in this part, we drop the suffix a and 7 in the notation <^^, and 
we assume Wkib^ < Wi{bl = a;* < Wk{bl ^,). Recall that from the expression |20l of Whittle's index 

value that VF/t(7r) = Wkib^) for vr G Bk, it > b^, k = 1,2. We first characterize the structure of C. From the 
description in Lemma [T] we know that the non-zero elements of C are 

hl+l 

Co ■= Co,l =Co,2 = • • • = Co,/i*) Co,h,* + l = (1 — P*)Co,hp Cl,l = 1 — ^ Co,h) 

h=l 

Co •= Co,i =Co,2 = • • • = Co,h'-i = Co,/ij ) Ci,i = 1 — ^ Co,/i- 

h=l 

We shall proceed to construct a path from the state Z^ to an arbitrary neighborhood of C- For ease of 
exposition, in the proof we no longer consider the channels as unsplittable entities. Instead, the transition in the 
each stages deals with belief state evolution of certain fraction of users. As we shall see, under this assumption, 
we can construct a transition path of under the Whittle's Index Policy, that transits from Zp to the exact 

value C- Although the identified path may not be feasible in reality for small value of N, but as the number 
of users N increases, we can find a transition path, which operates each user as unsplittable entities, that is 
arbitrarily close to this identified path, and thus can ultimately get arbitrarily close to any neighborhood of C. 

Note that when Z^[h] = Z^, Z^[ti] = [Z^'^[ti], Z^'^[ti]], where 

Zlf[h] + Z^ih] = 71, and zff [ti] + Z^'^'ih] = 72. 

In the following construction we shall assume that belief values are updated at the end of each slot when the 
actual channel states are revealed. 

Case (1). Suppose h*^ > and Wk{bl) > Wk{b'i). We shall denote h[ = max{/ : W\bl i) < W^{b'i)}. In this 
case. The path is constructed with the stages below, starting from state = Zp . 

Stage 1.1. In the first slot, among the a fraction activated channels, a — Cohi+i amount remains in ON state, 
and Co h'+i amount turn out in OFF state and are in class 1. Hence the end of this slot, Z^ = [Z^'^ , Z^'^] has 
the following non-zero elements 

^0,1 - ^0,h'i+l' ^1,1 - 71 - Co./iJ+l' ^1,1 + -72- 

Stage 1.2. In each of the next slots, a — Co amount in the activated channels turn out in ON state, and Cq 
amount of them turn out to be in OFF state and are in class 1. So at the end of the last slot of this stage, the 
non-zero elements of the system state Z^ = [Z^'^ , Z^'^] satisfies 

ryl,N _ryl,N _ _ ryl.N _ ^1 rTl,N _ a1 ,71,7V _ ^1 ry2,N rv2,N _ 

^0,1 —^0,2 — ■ ■ ■ — ^0,hj — ^0' ^0,/it+l ~ 'i'O.h.t + l' -^1,1 — "S.!,!' ^1,1 + ^« ~ 72- 

Stage 2. In the next few slots, all activated channels turn out to be in state 1. This stage goes on for h'^ — h\ 
slots, until those channels that reach belief state 6q ^ at the end of stage 1.1 are in belief state 6q ^j/^^^- Then by 
the end of the last slot of this stage, the non-zero elements of the system state Z^ satisfies 

ryl,N _ _ yl,N _^1 yl,N _ ^1 r^l,N _ ^1 rj2,N . rj2,N _ 

^0,/ii-/iI+l - • • • - ^0,/ii -''O' ^0,/xi + l - ''0,ftt+l> ^1,1 -".l.l) ^1,1 -72- 
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Stage 3. In each of the following slots, among all channel activated, only those in belief state 6q ^, _|_^ turn out 
to be in OFF state. This stage goes on until those channels that transit to belief state in stage 2 reaches 
belief state 6q /j.„/j.+i- Hence by the end of the final slot of this stage, 

yl,N _ _ yl,N _ >1 yl,N _ >1 yl,N _ _ yl,N _ >1 y2,N y2,N _ 

^0,1 — ■ ■ ■ — ^0,fel-/i5 ~ ''O' ^0,/iI-/i5+l ~ ''O./iJ+l' ^0,hi-hJ+2 — ■ ■ ■ — -^O./i'i+l ~ ^0' ^1,1 + —72- 

Stage 4. In each of the next slots, among all users activated, those in belief state 6^ ^, turn out to be in 
OFF state, and amount of activated channels in class 2 turn out in OFF state. Then by the end of the final 
slot in this stage, the system state will be = i.e., 

yl,N _yl,N _ _ yl,N _ ^1 yl,N _ >1 yi,N _ /-I 

^0,1 —^0,2 — " ' — ^0,hl — ^O-- ~ ''O./iJ+l' ^1,1 —''1,1 

y2,N _y2,N _ _ y2,N _ y2,N _ ^2 y2,N _ ^2 
^0,1 —^0,2 — " ' — '^0,h*~^ ~ 0,hi ~ ^0-: ^1,1 — 

Case (2). Suppose Wkib]) > Wk{b1) and h\ < h*. We shall let h[ = max{/ : W^{bli) < W^ib"^,)} and 
d = [/i2/(^'i + 1)J- Starting from state = Zp , the path is constructed with the stages below, where stage 

1.1 and 1.2 are the same with the previous case. 

Stage 1.1. In the first slot, among the a fraction of activated channels, only Coh*+i amount turn out in OFF state 
and they are in class 1. Therefore at the end of this slot, Z^ = [Z^'^, Z^'^] with non-zero elements being 

yl,N f.\ yl,N yl^N a1 y2,N . y2,N 

^0,1 -Co,hI+l' ^1,1 - 71 - Co./iI+H ^1,1 -72- 

Stage 1.2. In each of the next slots, a — amount of activated channels are in state '1', and (^q amount 
are in OFF state and are in class 1. Hence at the end of the last slot of this stage, the non-zero elements of 
ZN ^ [z^,N^z^^^] satisfies 

yl,N _yl,N _ _ yl,N _ >1 yl,N _ ^\ yl,N _ /-I y2,N y2,N _ 

^0,1 —^0,2 — ' " — ^0,h'i ~ ~ 'i'O.ht+l' ^1,1 —''1,1' ^1,1 + —72- 

Letting t2 be the slot right after stage 1.2, the path proceeds as follows. 
Stage 2. 

(1) From slot t2 to slot ^2 + ^ — — 1> all activated channels in class 1 turn out to be in state 1. Hence at the 
end of slot t2 + h'^ — hi — 1, the channels that reach belief state 6q ^.^^^ at the end of stage 1.2 are in belief state 
&o h[+i- Next, from slot t2 + h[ — to slot t2 + {d+ l){h[ + 1) — 1, among the activated channels in class 1, only 
those in belief state ^, turn out to be in OFF state. Therefore, at the end of slot t2 + {d + + 1) — 1, 
the system state vector Z^'^ that correspond to class- 1 channels is 

yl,N _yl,N _ _ yl,N _ ^1 yl,N _ >1 y^,N _ /-I 

^0,1 -^0,2 - ' " - ^0,hl — ^0^ ^0,hl+l — ^0,hl+l^ ^1,1 — 

(2) In the meanwhile, from slot t2 + {d+l){h[ + 1) - /ig - 1 to slot t2 + {d+l){h[ + 1) - 1, among the activated 
channels in class 2, amount turn out to be in OFF state. Hence by the end of slot ^2 + + + 1) — 1> 
the vector Z^'^ that correspond to class-2 channels is 

y2,N _ y2,N _ _ y2,N _ y2,N _ ^2 y2,N _ >2 

^0,1 —^0,2 — ■ ■ ■ — ^0,h*-l — ^0,h* — SO' ^1,1 — ''1,1- 

Therefore, at the end of slot t2 + {d + l){h[ + 1) - 1, Z^ = C 

Appendix G 
Proof of Lemma [5] 

The proof is a discrete-time version of the proof of Theorem 6.89 from ||2TI . We first present a lemma which 
is an extension of Lemma [T9l 

Lemma 20. There is a neighborhood of with < (5, for which if Z^[0] = x £ Q^{(^"), then for 

any fi > and time T, there exist positive constants pi and p2 with, 

PJ sup ||Z^[t] - Cll > p) < piexp(-iVp2) 

^0<t<T ^ 
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where pi and p2 are independent of x and N. 

Proof: Note that we have established, in Lemma [161 the local convergence of the fluid approximation model 
z[t] in a neighborhood OctCC" )■ We let < and let "9 < 5 (recall that 5 is defined in proposition |4] with 5 < a) 
be such that if z[0] G r2^(C°), then 

II^M -C"|| < i^, vt>o. 

From Proposition |4l there exist positive constants pi and p2 with, 

PJ sup ||Z^[t]-C|| >/^) <P^{ sup ||Z^[t]-z[t]|| + ||z[t]-C|| >p) 

^0<t<T ' 0<t<T 

<Px{ sup >p-u) 

0<t<T 

<Px{ sup >p-u) 
0<t<T 

<pi exp(-iVp2), 

which proves the lemma. ■ 
We let < 1? be such that if ^[0] G ^eAC^)^ then z[t\ e Q.e{C^) for t > 0. 

We let Q2n, n = 0, 1, • • • be the time slots of consecutive hitting times into the neighborhood Q.f^AC^) from 
outside of the neighborhood when the total number of users is N. Similarly, we let ^>2n+l' = 0, 1, • • • denote 
the time slots of exiting the neighborhood Q.AC^) from inside of the neighborhood, when the total number of 
users is N. Hence y„ = Z^[q!^], n = 0, 1, • • • evolves as a Markov chain. In steady state, 

-^l^'2n+2 ~ e2n\ ^W2n+2 ~ 02n+l\ + ^W2n+1 ~ 02n\ 

We let Tf:{N) denote the random variable Q2n+i ~ Q2n- Fo'' ^'^Y constant K > 0, we have 

oo 



E[TAN)] = ^t-P{UN) = t) 
t=i 

> 2K-P{TAN) > 2K) 



Note that 



= 2K.P2«[,« ] sup \\Z''[t]-q\\<e). (53) 

Pz«[s!^^A sup ||Z^[t]-C^|| >e 

>r„+i<<<e"„+i+2ft' 

= V P(Z^(^>f) = z)P.( sup ||Z^[t]-C^||>e). (54) 

Since < ??, from Lemma l20l there exist positive constants and ^2 such that for any z € Q^^AC^)^ 

pJ sup - a\\ > e) < <?i exp(-<^2A^). (55) 

^0<t<2K ^ 

Substitute (l55l) in (l54l) we have 



Pz-{,- sup WZ'^ [t] - C^ll > e < ?i exp(-?2A^). 

S^'„+i<t<e2"+i+2X ^ 



Therefore, -P^n^f^jv^ , ( sup ^. — C^ll < e ) — )■ 1 as m — > cx). From (|53]| . if m is large 

enough, we have 

E\Q^,- - > K. 
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Since K can be arbitrarily large, linim^oo EIq^"" — Qq""] = oo. Since from Assumption we know E[q2 



Qi""] < Me,, thus from equation 



which concludes the proof. 



lim P(Z^-[oo] ^a(C^)) =0, 



Appendix H 
Proof of Proposition [3] 

For any ^ > 0, let e > be such that for x ^ Z, ii \\x — C^)\\ < e, then 

|-y(a;) — r(7, a)\ < £. 
Consider fixed Nm, for \/£ > denote event En^ = {Z^- [oo] € ^e{C^)}, then 



<E 



v{Z^-[oo])-v{0\ 



En„. 



\viZ''-[oo])-v{C^) 
Apply Lemma [5] to (l56l ) we have 



+ P{En„^)E \v{Z^^-[oo])-v{C 



S7 J 



lim 

m— >-oo 



< lim 

m— >-oo 



Since i can be arbitrary. 



lim 

m— >-oo A', 



r(7,a), 



which proves the proposition. 



(56) 



Appendix I 
Proof of Lemma [15] 

After some calculation, the matrix U* takes the form 

\Q^{z) B 



where matrix B is expressed as 



B 







U* 



Q\z)\ 



Kh,-^ 



0,h'i 



1 



-bl,,: 



-bo,hi 



-bl,H: 



in which only the first, last and + 1^^ row have non-zero elements, and for each row, non-zero terms start at 
the element. 
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The matrices Q^{z) and Q^{z) are expressed as, 



-1 

1 -1 



b, 



!,1 



1,1 



hi.. 



Q\z) 



-1 

1 -1 



^-hlni 



1 ^0,/i|+l 



?)2 



1 -P2 



We need the following lemma to proceed. 
Lemma 21. For any I G Z+, 

Proof: The proof is moved to Appendix |Jl ■ 

With this lemma, we proceed to characterize the eigen values of matrix U* , which are given by the solution 
to equation det(f7* — A/) = 0, where 



det(f7* - A/) = det 



Q^{z)-\1 B 



det 



'Q^{z)-\I 

Q^{z) - XI 



Q^{z)-XI\ 

where the second equality is from the property of block matrices. Therefore, we have 

det(?7* - A/) = det(Q^(z) - XI) det{Q\z) - XI). 
(1) We first study the characteristic polynomial det{Q^{z) — XI). After some algebra we have 
detiQ\z) - XI) = (1 + A)^-'^^ [[A + + X)^'^-' - 

(&o,hj+i-^o,hi) [1 + (1 + A) + (1 + A)2 + . . . + (1 + A)'^^-2T 
^(l + A)"Jxi(A). 

where 

Xi(A) = [A+(l-pi)+6j,,.](l+A)'^^-i-(6j,,.+i-6j,,.) [1+(1+A)+(1+A)2+ • • • +{M)^'^-'] 

Consider the equation Xi('^) = 0' 

[A+(l-pi)+6S,,.](l+A)^J-i = (5i^,.^^-6i^,.) [i+(i+A)+(l+A)2+ • • • +(l+A)'^^-2] . 

Clearly, matrix Q^{z) has eigen value —1 of multiplicity 2t — Let A be any other eigen value of Q^{z), 
we proceed to show that |A + l| < 1. 



(57) 
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We prove this by contradiction, suppose A is such that | A + l| > 1. Then taking modulus of the left hand side 
of equation (l57l ) we have 



|A + l-p + 6j,,.|-|l + A|"^- 



Pi + 6i 



'0,ht I 



|[A+(l-pi)+6j,,.](l+A)^*-^| 

> (|A + l| + 

> (i-pi + 6^^^.j|i + A 
where the first equality is from triangle inequality. Applying Lemma |2T] we have, 



1 + A 



h'-l 



l-Pi+6o,hH|l + 



>(/it-i)(feo,hr+i-feo,ht)-|i + A 



1+ 1 + A +•••+ 1 + A 



Hbh^+i - 6S,hj)|l+(l+A)+ • • • +(l+A)'^^ 



(58) 



where the first inequality is from Lemma 1211 and the second inequality is from the fact that |A + l|>l, and the 
last inequality comes from Triangle Inequality. Note that inequality ( [58] ) contradicts ( [57] ). Therefore each eigen 
values of matrix Q^{z) must satisfy |A + l| < 1. 

(2) We then study the characteristic polynomial det{Q'^{z) — XI). We derive that 

det(g2(2) _ XI) 



-il + X) 



2T-h' 



= (1 + \)^^-^'^ 

where 



[(l-P2) + (l-6g,,.)A] 
X2(A), 



1 + (1 + A) + --- + (1 + A)' 



+ 



il + X)'^'^-%l-p,) + X][2 + X) + bl,. 



(59) 



X2(A)=[(l-p2)+(l-6g,,.)A] [l+(l+A)+ • • • +(l+A)'^^*-3 
and consider 



+(l+X)'^'^-%l-p,)+X]{2+X)+bl,, 



A • X2(A) = [(l-p2)+(l-6g,,.)A] a[i+(1+A)+ • • • +(l+A)'^^-3j ^(^i^x)'^'^~^x[[{l-p2)+X] (2+A)+62 

= [(l-p2)+(l-62^,.)A] (1+A-l) [l+(l+A)+ • • • +(l+A)'^^--3l +(i+^)/..-2^ r[(-^_^^)^^] {2+X)+bl, 



:[(l-P2)+(l-6g,,.)A] [(1+A) 



h:-2 



1]+(1+A)^ 



[(1-k)+A](2+A)+62,. 



[(l-P2)+(l-6g,,5)A]+(l+A)'^^-2r^[(l_^^)^^](2+A)+62^,.A+[(l-p2)+(l-6g,,.)A] 



[(l-P2)+(l-6g,,5)A]+(l+A) 



h:-2 



[(1-P2)+A](2+A)+1 +(1-P2) 



[{l-p,)+{l-bl^,)X] +(1+A)^^"2 J(i_p2)(2+A)+(A + 1)2] + (1-P2)' 

[(l-p2) + (l-6g,,.)A]+(l + A)^^-2j(i_p^)(l^^)2^^(^^l)2) 

[(l-P2)+(l-6g,,5)A] +(l+A)'^^-2 J(i_p2+A)(A + 1) 
[{l-p,)+{l-bl^,)X]+{l+X)'^Hl-P2+X) 



(60) 



It is clear from equation ([59] ) that matrix Q'^{z) has eigen value —1 with multiplicity 2t — /ig. Let A be any 
eigen value of (^"^{z), we first show the following lemma. 

Lemma 22. Let X be any eigen value of (^"^{z), then —2 < Re{X) < 0. 
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Proof: 1) Suppose Q'^{z) has an eigen value of 0, then, from ( [59l ). X2(0) = 0. However, 

X2(0) = {l-p2){hl-2)+2{l-p2)+hlni 

7^0, 

leading to a contradiction. Hence Q'^{z) does not have eigen value. 

2) Suppose the equation X2(A) = has a root \* = a + hi with a > 0, or a < —2, or being purely imaginary 
with a = 0, & 7^ 0. Hence from equation (l60l ). 

(l-P2)+(l-6g,,5)A*=(l+A*)''5(l-p2+A*) (61) 

Consider the modulus of the right hand side, 

\{l+a + hif^ \ ■\l-p2 + a + bi\>\l-p2 + a + hi\ 

> \l-p2 + {l-bl,,,){a + bi)\ 
= \l-p, + {l-bl^,)X*\. 

The above expression contradicts the previous equation ( |6T]) . 

From 1) and 2) we conclude that X2W = can only have solution with real part within (—2,0). Therefore 
all eigen values of matrix Q'^{z) have real part within (—2,0). ■ 

We proceed to show that each eigen value A of Q^{z) needs to satisfy |A + l| < 1. 
Suppose the equation X2W = has a root A with |A + l| > 1, then from equation ( [6OI ). 

{l-p^)+{l-bl^,)X=il+Xpil-p2+X) (62) 

We let 1 + A = x + yi where x, y € K, from the previous lemma we know that \x\ < 1. Some derivation shows 
that 

l(l-P2+A)p-|(l-P2)+(l-6g,,.)Ap 
= |1 + A|2(2 - bl^.)blf,, - 2xbl^,{l-p2 - bl^,) + blf,,{2p2 - bl^,) 
>\x\{2 - bl^,)bli,, - 2\x\blj,,{l - p2 - 52^.) + \x\bl^,{2p2 - blf,.) 

= |x|62 [(2 - bl^,) - 2(1 -p2 - bl^,) + (2p2 - blj,,) 
=0. 

where the first inequality is from the assumption that [1 + A| > 1 and the fact that |x| < 1. Therefore 

|(1-P2+A)(1 + A)''^| > |(1-P2+A)| 

> |(i-p2)+(i-62^.)A|. 

The above expression contradicts the equation (|62l ). Hence it can not be |A + l| > 1. Therefore, each eigen 
value A of U* satisfies |A + l| < 1, concluding the proof. 

Appendix J 
Proof of Lemma [IT] 

Proof: From the belief value evolution ([T]) we know 

,1 n-nipi-riY . 1 ; 

^0,/ = 1 , ^ — - — ' ^o,/+i - ^0,1 = n{pi - n) . 

L + ri — pi 
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Therefore 



(l-pi) + irj,, -(i-l)(»J,,+l-4,) 

Kl-P.) + '''::'''""'''' -(i-l)nto-n)' 

1 + ri - pi 

--{I- Pi) + — — / - 1 ri(pi -n 

1 + ^'l - Pi 



= (1 -pi) + ri 
= (1 -Pi) + n 

=(1 -pi) + n 

= (1 -pi) +ri 
= (1 -pi) +ri 



1 - (pi - ri)' 



(Z-l)(pi-riy 



1 + ri - pi 
l + (/-l)(pi-ri)'+i-/(pi-ri)' 



l+ri-pi 

l + l){pi - ri)'(pi - ri - 1) - (pi - ri)' 
1 + ri - pi 

(/ - l)(pi - ri)'(pi - n - 1) - {pi - ri - 1)(1 + (pi - n) + • • • + {pi - ri)'-^) 



1 + ri - 



(1 + (pi - ri) + • • • + {pi - ri)'-i) - (/ - - n) 



(63) 



Since (pi — ri)-^ > (pi — ri)' for Z = 1, • • • , j — 1, therefore from equation (|63] ). 



(1 - pi) + 4; - (/ - - T^li) > (1 - Pi) + ri > 0, 



which proves the lemma. 



Appendix K 
Derivation of Index Values 

Here we derive the Whittle's indices according to Definition ( fT9l ). by studying the relationship between the 
threshold value and the subsidy value. 

(Case 1) vr = 6q; < b^. We let V{uj,hQ^) denote the reward-plus-subsidy for the w-subsidy problem when the 
threshold for activation is at h^^, i.e., the channel transmits when the belief is no smaller than b^^ and stays idle 
otherwise. Some algebra (of studying the steady state belief transition) shows that 



v{u:Ai) 



6^, + c^(l-pfc)(/-l) 



hli + {l-pk){l) 



(64) 



From the definition ( fT9l ) of the Whittle's index value, it is equally optimal to activate or idle the channel with 
the belief value h^^ at the subsidy value Wk{bQi). From thresholdability, the belief value h^^ is at the boundary 
of the idle set X^{Wk{b^i))- Therefore the reward obtained by setting the threshold for activation at b^i equals 



that with threshold b^^j^^, i.e.. 



v{wMi),bl 



%il-V{Wk{b'^^i),bl^i+,), 
where V{Wk{bQi),bQi) represents the reward corresponding to a^^j-^fc )(^o/) 



represents the reward corresponding to a^^^^t )(^o /) ~ ^- 



1, and V{Wk{bli),bli^{] 



Substitute expression (164] ) in the previous relationship leads to the expression of the Whittle's index value. 



Wk{hh) 



ibii-%+i){l + l) + h\ 



o,/+i 



(65) 



^-Pk + {%-%+i)l + %+i 

which is the same as in |[T4l . 

(Case 2) TT > 6g. In this case, we first present the following claim. This claim states that, if the threshold for 
activation is above 6^", then it is optimal to always stay idle. 

Claim 1. IfOkioj) > b^ then l^{uj) = B''. 
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This claim is indeed true because, from Lemma [T] if 9k{uj) > b^, then eventually all users will be idle, hence 
it is optimal to always stay idle. Hence, for all belief states vr > 6^, their Whittle's index value, according to 
the definition, equals to the infimum subsidy value for which the channel always staying idle. Note that Wkiir) 
monotonically increases with vr for tt < therefore, 

Wk{7r) = lim Wkib'^i). 

From (|65] ) we get 

/ \ 

^ [l - Pk){l + Tk - Pk) + rk 



